ST446 Assessment 2
Dataset
We use an English-language Wikipedia dump dataset in this assessment, similarly to Assignment 1. You must use the dump file available for download from here. This is a bzip2
compressed XML file.
Cluster configuration
Each problem requires a different cluster configuration (see below). You can submit separate notebooks for each question.
Remember to adjust to your project name, bucket and other parameters.
For P1, use the following configuration:
gcloud dataproc clusters create st446-cluster --properties=^#^spark:spark.jars.packages=graphframes:graphframes:0.8.2-spark3.1-s_2.12,com.databricks:spark-xml_2.11:0.4.1 --enable-component-gateway --region europe-west2 --zone europe-west2-c --single-node --master-machine-type n2-standard-4 --master-boot-disk-size 500 --image-version 2.0-debian10 --optional-components JUPYTER --project st446-lt2023 --metadata 'PIP_PACKAGES=sklearn nltk pandas graphframes'
For P2, use the following configuration:
BUCKET="st446-w9-bucket"
gcloud beta dataproc clusters create st446-cluster --project st446-lt2023 --bucket ${BUCKET} --region europe-west3 --image-version=1.4-debian10 --optional-components=ANACONDA,JUPYTER --enable-component-gateway --initialization-actions gs://goog-dataproc-initialization-actions-europe-west2/python/pip-install.sh,gs://${BUCKET}/my_actions.sh --metadata 'PIP_PACKAGES=sklearn nltk pandas numpy'
P1 Graph data processing
In this exercise, the task is to perform graph data processing using PySpark Graphframes API. You should use graph queries not dataframe/SQL queries.
P1.1 Creating a Vertex dataframe
You need to create a Vertex dataframe by first creating three Vertex dataframes and then creating the final Vertex dataframe as the union of the three Vertex dataframes (union by column name). The specification for the three Vertex dataframes is as follows.
- Collaborator Vertex dataframe
vco
- Vertex ID is md5 hash of the concatentation of
username
andcontributor id
strings - Attribute column name
type
, String type, column values = "contributor" - Attribute column name
contributorID
, String type, column values arecontributor id
values - Attribute column name
name
, String type, column values areusername
values
- Vertex ID is md5 hash of the concatentation of
- Page Vertex dataframe
vpa
- Vertex ID is md5 hash of the concatenation of
page id
andpage title
- Attribute column name
type
, String type, column values = "page" - Attribute column name
pageID
, String type, column values arepage id
values - Attribute column name
title
, String type, column values are page titles
- Vertex ID is md5 hash of the concatenation of
- Category Vertex dataframe
vca
- Vertex ID is md5 hash of category name
- Attribute column name
type
, String type, column values = "category" - Attribute column name
category
, String type, column values are category names
The final Vertex dataframe, v
, must be the union of Vertex dataframes vco
, vpa
and vca
(union by column name).
Show the schema and top 5 rows for each of the four Vertex dataframes.
Note: all the md5 hash values must be encoded in hexadecimal format.
P1.2 Creating an Edge dataframe
You need to create an Edge dataframe, by first creating two Edge dataframes and then creating one final Edge dataframe as union of the two Edge dataframes that you have created. The two Edge dataframes that you need to create first are such that one (contributor-page) contains information about edges connecting a contributor as the source vertex and a page as the destination vertex, and the other (page-category) contains information about edges connecting a page as the source vertex and a category as the destination vertex. The specification for the two Edge dataframes is as follows:
- Contributor-page Edge dataframe
ecp
- Attribute column name
type
, String type, column values = "revision" - Attribute column name
timestamp
, Timestamp type, column values are revision times
- Attribute column name
- Page-category Edge dataframe
epc
- Attribute column name
type
, String type, column values = "is of category"
- Attribute column name
The final Edge dataframe, e
, must be the union of Edge dataframes ecp
and epc
(union by column name).
Show the schema and top 5 rows for each of the three Edge dataframes.
P1.3 Basic graph queries
Create a graphframe g
with Vertex dataframe v
and Edge dataframe e
defined in P1.1 and P1.2, respectively. Use g
and grapframes API to perform the following queries.
- Show the counts for the number of contributors, pages, and categories.
- Show the counts for the number of edges connecting contributor and page vertices, and edges connecting page and category verticies.
- Compute and show a plot for the cumulative number of revisions versus time.
- Show the top 5 contributors with respect to the number of page revisions. The output must contain name of the contributor and the number of revisions.
- Create a subgraph defined as the subgraph of
g
that is the bipartite graph with page and category vertices.
P1.4 Contributions over categories
Using graphframe g
show:
- top 10 categories with respect to the number of revisions of pages of given categories. Your output must contain category name and the count of the number of revisions;
- top 10 categories with respect to the number of distinct contributors per category. Your output must contain category name and the corresponding number of distinct contributors.
P1.5 A graph for pages
Using graphframe g
, create a graph whose vertices are pages and there is an edge between two page vertices if, and only if, the two pages have at least one category in common.
Compute PageRank scores for vertices of the created graph. Show the top 10 vertices in decreasing order of their PageRank scores. Your output must contain page title and the page PageRank score.
Do the same as asked above but define the graph such that there is an edge between two page vertices if, and only if, the two pages have at least 5 categories in common.
P2 Topic modelling
In this exercise, the task is to perform topic modelling based on Wikipedia data. You can use the seminar materials as reference and apply any topic modelling models available in PySpark.
P2.1 Creating a document corpus
You need to create a corpus (set of documents) based on Wikipedia pages. First, you need to create a dataframe with page id, title, text (limited to 3,000 words) and the category list retrieved from each Wikipedia page. You have some freedom to decide which sections you will use to extract the 3,000 words (for instance, you can skip the History section that most Wikipedia pages have). You need to filter pages that have at least one category.
You must: i) show the top 10 rows of the dataframe; and ii) apply pre-processing steps to parse the data, remove stop words, and create feature vectors and the vocabulary. Show the first 20 feature vectors and corresponding vocabulary entries.
P2.2 Perform topic modelling
You should apply the LDA model (from PySpark) over the dataset to classify each document into a distribution of topics. You should aim to identify 15 topics across all Wikipedia pages and no less than 10 words describing each topic. For each topic, show the top 10 words and corresponding weights. For the first 15 pages in the dataframe, print their title, topic and corresponding words. Your task is to analyse and discuss whether the topic and words do actually represent/approximate the title. You can increase the sample size to better support your analysis, if necessary.
Tip for training: you can use any batch approach to train your model. The online version of LDA, for instance, takes a long time to finish, so you may avoid it.
P3 Your graph data analytics question
Formulate a qraph data analytics question. Your question should be summarised in one sentence. You may elaborate with some additional text, but your question should be well summarised in one sentence. The only input for your analysis must be the graphframe g
defined in P1.3. Implement and evaluate your data analysis. Comment the results.
Marking scheme
Problem breakdown | Max points |
---|---|
P1-1 - Create a Vertex dataframe | 5 |
P1-2 - Create an Edge dataframe | 5 |
P1-3 - Basic graph queries | 5 |
P1-4 - Contributions over categories | 10 |
P1-5 - A graph of pages | 15 |
P2-1 - Corpus and data pre-processing | 10 |
P2-2 - Topic modelling | 15 |
P3 | |
- Use of graph queries and correctness of the solution | 10 |
- Complexity of the graph data analytics question | 10 |
- Originality of the question | 15 |
Total | 100 |