1. Homepage
  2. Programming
  3. ST446 Assessment 2: LDA Clustering

ST446 Assessment 2: LDA Clustering

Engage in a Conversation
PythonPage RankLDAClusteringTopic ModellingData AnalysisST446Distributed Computing for Big DataUKLSESpark

ST446 Assessment 2

Dataset

We use an English-language Wikipedia dump dataset in this assessment, similarly to Assignment 1. You must use the dump file available for download from here. This is a bzip2 compressed XML file. CourseNana.COM

Cluster configuration

Each problem requires a different cluster configuration (see below). You can submit separate notebooks for each question. CourseNana.COM

Remember to adjust to your project name, bucket and other parameters. CourseNana.COM

For P1, use the following configuration: CourseNana.COM

gcloud dataproc clusters create st446-cluster --properties=^#^spark:spark.jars.packages=graphframes:graphframes:0.8.2-spark3.1-s_2.12,com.databricks:spark-xml_2.11:0.4.1 --enable-component-gateway --region europe-west2 --zone europe-west2-c --single-node --master-machine-type n2-standard-4 --master-boot-disk-size 500 --image-version 2.0-debian10 --optional-components JUPYTER --project st446-lt2023 --metadata 'PIP_PACKAGES=sklearn nltk pandas graphframes'

For P2, use the following configuration: CourseNana.COM

BUCKET="st446-w9-bucket"

gcloud beta dataproc clusters create st446-cluster --project st446-lt2023 --bucket ${BUCKET} --region europe-west3 --image-version=1.4-debian10 --optional-components=ANACONDA,JUPYTER --enable-component-gateway --initialization-actions gs://goog-dataproc-initialization-actions-europe-west2/python/pip-install.sh,gs://${BUCKET}/my_actions.sh --metadata 'PIP_PACKAGES=sklearn nltk pandas numpy'

P1 Graph data processing

In this exercise, the task is to perform graph data processing using PySpark Graphframes API. You should use graph queries not dataframe/SQL queries. CourseNana.COM

P1.1 Creating a Vertex dataframe

You need to create a Vertex dataframe by first creating three Vertex dataframes and then creating the final Vertex dataframe as the union of the three Vertex dataframes (union by column name). The specification for the three Vertex dataframes is as follows. CourseNana.COM

  • Collaborator Vertex dataframe vco
    • Vertex ID is md5 hash of the concatentation of username and contributor id strings
    • Attribute column name type, String type, column values = "contributor"
    • Attribute column name contributorID, String type, column values are contributor id values
    • Attribute column name name, String type, column values are username values
  • Page Vertex dataframe vpa
    • Vertex ID is md5 hash of the concatenation of page id and page title
    • Attribute column name type, String type, column values = "page"
    • Attribute column name pageID, String type, column values are page id values
    • Attribute column name title, String type, column values are page titles
  • Category Vertex dataframe vca
    • Vertex ID is md5 hash of category name
    • Attribute column name type, String type, column values = "category"
    • Attribute column name category, String type, column values are category names

The final Vertex dataframe, v, must be the union of Vertex dataframes vco, vpa and vca (union by column name). CourseNana.COM

Show the schema and top 5 rows for each of the four Vertex dataframes. CourseNana.COM

Note: all the md5 hash values must be encoded in hexadecimal format. CourseNana.COM

P1.2 Creating an Edge dataframe

You need to create an Edge dataframe, by first creating two Edge dataframes and then creating one final Edge dataframe as union of the two Edge dataframes that you have created. The two Edge dataframes that you need to create first are such that one (contributor-page) contains information about edges connecting a contributor as the source vertex and a page as the destination vertex, and the other (page-category) contains information about edges connecting a page as the source vertex and a category as the destination vertex. The specification for the two Edge dataframes is as follows: CourseNana.COM

  • Contributor-page Edge dataframe ecp
    • Attribute column name type, String type, column values = "revision"
    • Attribute column name timestamp, Timestamp type, column values are revision times
  • Page-category Edge dataframe epc
    • Attribute column name type, String type, column values = "is of category"

The final Edge dataframe, e, must be the union of Edge dataframes ecp and epc (union by column name). CourseNana.COM

Show the schema and top 5 rows for each of the three Edge dataframes. CourseNana.COM

P1.3 Basic graph queries

Create a graphframe g with Vertex dataframe v and Edge dataframe e defined in P1.1 and P1.2, respectively. Use g and grapframes API to perform the following queries. CourseNana.COM

  • Show the counts for the number of contributors, pages, and categories.
  • Show the counts for the number of edges connecting contributor and page vertices, and edges connecting page and category verticies.
  • Compute and show a plot for the cumulative number of revisions versus time.
  • Show the top 5 contributors with respect to the number of page revisions. The output must contain name of the contributor and the number of revisions.
  • Create a subgraph defined as the subgraph of g that is the bipartite graph with page and category vertices.

P1.4 Contributions over categories

Using graphframe g show: CourseNana.COM

  • top 10 categories with respect to the number of revisions of pages of given categories. Your output must contain category name and the count of the number of revisions;
  • top 10 categories with respect to the number of distinct contributors per category. Your output must contain category name and the corresponding number of distinct contributors.

P1.5 A graph for pages

Using graphframe g, create a graph whose vertices are pages and there is an edge between two page vertices if, and only if, the two pages have at least one category in common. CourseNana.COM

Compute PageRank scores for vertices of the created graph. Show the top 10 vertices in decreasing order of their PageRank scores. Your output must contain page title and the page PageRank score. CourseNana.COM

Do the same as asked above but define the graph such that there is an edge between two page vertices if, and only if, the two pages have at least 5 categories in common. CourseNana.COM

P2 Topic modelling

In this exercise, the task is to perform topic modelling based on Wikipedia data. You can use the seminar materials as reference and apply any topic modelling models available in PySpark. CourseNana.COM

P2.1 Creating a document corpus

You need to create a corpus (set of documents) based on Wikipedia pages. First, you need to create a dataframe with page id, title, text (limited to 3,000 words) and the category list retrieved from each Wikipedia page. You have some freedom to decide which sections you will use to extract the 3,000 words (for instance, you can skip the History section that most Wikipedia pages have). You need to filter pages that have at least one category. CourseNana.COM

You must: i) show the top 10 rows of the dataframe; and ii) apply pre-processing steps to parse the data, remove stop words, and create feature vectors and the vocabulary. Show the first 20 feature vectors and corresponding vocabulary entries. CourseNana.COM

P2.2 Perform topic modelling

You should apply the LDA model (from PySpark) over the dataset to classify each document into a distribution of topics. You should aim to identify 15 topics across all Wikipedia pages and no less than 10 words describing each topic. For each topic, show the top 10 words and corresponding weights. For the first 15 pages in the dataframe, print their title, topic and corresponding words. Your task is to analyse and discuss whether the topic and words do actually represent/approximate the title. You can increase the sample size to better support your analysis, if necessary. CourseNana.COM

Tip for training: you can use any batch approach to train your model. The online version of LDA, for instance, takes a long time to finish, so you may avoid it. CourseNana.COM

P3 Your graph data analytics question

Formulate a qraph data analytics question. Your question should be summarised in one sentence. You may elaborate with some additional text, but your question should be well summarised in one sentence. The only input for your analysis must be the graphframe g defined in P1.3. Implement and evaluate your data analysis. Comment the results. CourseNana.COM

Marking scheme

Problem breakdown Max points
P1-1 - Create a Vertex dataframe 5
P1-2 - Create an Edge dataframe 5
P1-3 - Basic graph queries 5
P1-4 - Contributions over categories 10
P1-5 - A graph of pages 15
P2-1 - Corpus and data pre-processing 10
P2-2 - Topic modelling 15
P3
- Use of graph queries and correctness of the solution 10
- Complexity of the graph data analytics question 10
- Originality of the question 15
Total 100

Get in Touch with Our Experts

WeChat WeChat
Whatsapp WhatsApp
Python代写,Page Rank代写,LDA代写,Clustering代写,Topic Modelling代写,Data Analysis代写,ST446代写,Distributed Computing for Big Data代写,UK代写,LSE代写,Spark代写,Python代编,Page Rank代编,LDA代编,Clustering代编,Topic Modelling代编,Data Analysis代编,ST446代编,Distributed Computing for Big Data代编,UK代编,LSE代编,Spark代编,Python代考,Page Rank代考,LDA代考,Clustering代考,Topic Modelling代考,Data Analysis代考,ST446代考,Distributed Computing for Big Data代考,UK代考,LSE代考,Spark代考,Pythonhelp,Page Rankhelp,LDAhelp,Clusteringhelp,Topic Modellinghelp,Data Analysishelp,ST446help,Distributed Computing for Big Datahelp,UKhelp,LSEhelp,Sparkhelp,Python作业代写,Page Rank作业代写,LDA作业代写,Clustering作业代写,Topic Modelling作业代写,Data Analysis作业代写,ST446作业代写,Distributed Computing for Big Data作业代写,UK作业代写,LSE作业代写,Spark作业代写,Python编程代写,Page Rank编程代写,LDA编程代写,Clustering编程代写,Topic Modelling编程代写,Data Analysis编程代写,ST446编程代写,Distributed Computing for Big Data编程代写,UK编程代写,LSE编程代写,Spark编程代写,Pythonprogramming help,Page Rankprogramming help,LDAprogramming help,Clusteringprogramming help,Topic Modellingprogramming help,Data Analysisprogramming help,ST446programming help,Distributed Computing for Big Dataprogramming help,UKprogramming help,LSEprogramming help,Sparkprogramming help,Pythonassignment help,Page Rankassignment help,LDAassignment help,Clusteringassignment help,Topic Modellingassignment help,Data Analysisassignment help,ST446assignment help,Distributed Computing for Big Dataassignment help,UKassignment help,LSEassignment help,Sparkassignment help,Pythonsolution,Page Ranksolution,LDAsolution,Clusteringsolution,Topic Modellingsolution,Data Analysissolution,ST446solution,Distributed Computing for Big Datasolution,UKsolution,LSEsolution,Sparksolution,