COMP2521 Data Structures and Algorithms - Assignment: Simple Graph Structure-Based Search Engine

Engage in a Conversation

Assignment

Simple Graph Structure-Based Search Engine

Prerequisite Knowledge

Abstract Data Types
Linked Lists (carry-over from COMP1511)
Trees & Binary Search Trees
Graphs
- Graph Basics
- Graph Representations
- Graph ADT
- Directed Graphs
Sorting (not absolutely necessary, but helpful)

Background

In this assignment, your task is to implement a simple search engine using the well-known PageRank algorithm (simplified for this assignment, of course!). You should start by reading the Wikipedia article on PageRank (up to the section Damping Factor). These topics will be discussed later, in lectures. CourseNana.COM

The main focus of this assignment is to build a graph structure, calculate PageRanks, and rank pages based one these values. CourseNana.COM

Mock web pages

To make things easier for you, you don't need to spend time crawling, collecting and parsing web pages for this assignment. Instead, you will be provided with a collection of mock web pages in the form of plain text files. CourseNana.COM

Each page has two sections: CourseNana.COM

Section 1 contains URLs representing outgoing links. The URLs are separated by whitespace, and may be spread across multiple lines.
Section 2 contains the actual content of the web page, and consists of zero or more words. Words are separated by whitespace, and may be spread across multiple lines.

Here is an example page: CourseNana.COM

#start Section-1

url2  url34  url1 url26
  url52 url21
url74  url6 url82

#end Section-1

#start Section-2

Mars has long been the subject of human interest. Early telescopic observations
revealed color changes on the surface that were attributed to seasonal vegetation
and apparent linear features were ascribed to intelligent design.

#end Section-2

Sections will always be delimited by lines containing #start Section-1, #end Section-1, #start Section-2 and #end Section-2. There will be no other characters on these lines (no leading or trailing whitespace). There may be lines containing whitespace before or after any section. CourseNana.COM

Note that it is completely valid for Section 1 to be empty - this would mean that the page does not have any outgoing links. CourseNana.COM

Setting Up

Change into the directory you created for the assignment and run the following command: CourseNana.COM

unzip /web/cs2521/23T0/ass/downloads/files.zip

If you're working at home, download files.zip by clicking on the above link and then run the unzip command on the downloaded file. CourseNana.COM

If you've done the above correctly, you should now have the following files: CourseNana.COM

Makefile: a set of dependencies used to control compilation
pagerank.c: a stub file for Part 1
invertedIndex.c: a stub file for Part 2
searchPagerank.c: a stub file for Part 3
test1, test2 and test3: directories containing test data

Note that in this assignment, you are permitted to create as many supporting files as you like. This allows you to compartmentalise your solution into different files. For example, you could implement a Graph ADT in Graph.c and Graph.h. You are encouraged to use a Makefile for this assignment, to make compilation faster and easier for you. The provided Makefile should be enough, for most cases. CourseNana.COM

File Reading

You will be required to read files in this assignment. In case you are unfamiliar with file IO, we have provided some sample programs in the file_io_demo directory that demonstrates the usage of the main file IO functions fopen, fclose, fscanf and fgets. These are the only file IO functions you will need to use in this assignment. CourseNana.COM

Part 1 - PageRank Calculation

Your task is to implement a program in pagerank.c that reads data from a given collection of pages in the files collection.txt, builds a graph structure using either an adjacency list or adjacency matrix from this data, and calculates the PageRank for every URL in the collection using the algorithm described below. CourseNana.COM

The URLs in collection.txt are separated by whitespace, and may be spread across multiple lines. Here is an example collection.txt: CourseNana.COM

url11 url21 url22  
    url23  
 url31 url24 url34

For each URL in the collection, there will be a text file that contains the associated page. To get the name of the file, simply append .txt to the URL. For example, the page associated with url24 is contained in url24.txt. CourseNana.COM

You need to read all the pages and build a graph structure using a graph representation of your choice. Then, you must use the algorithm below to calculate the Weighted PageRank for each page. CourseNana.COM

PageRank(

d

d i f f P R

m a x I t e r a t i o n s

)

Read pages from the collection in file collection.txt and build a graph structure using a graph representation of your choice
$N$ = number of URLs in the collection
For each URL $p_{i}$ in the collection
- $P R (p_{i}, 0) = \frac{1}{N}$
$i t e r a t i o n = 0$
$d i f f = d i f f P R$
While $i t e r a t i o n < m a x I t e r a t i o n s$ and $d i f f \geq d i f f P R$
- $t = i t e r a t i o n$
- For each URL $p_{i}$ in the collection $P R (p_{i}, t + 1) = \frac{1 - d}{N} + d \times \sum_{p_{j} \in M (p_{i})} \frac{P R (p_{j}, t)}{L (p_{j})}$
- $d i f f = \sum_{i = 1}^{N} | P R (p_{i}, t + 1) - P R (p_{i}, t) |$
- $i t e r a t i o n$ ++

In the above algorithm: CourseNana.COM

$M (p_{i})$ is a set of pages with outgoing links to $p_{i}$ (ignore self-loops and parallel edges)
$L (p_{j})$ is the out-degree of the page $p_{j}$
$t$ and $t + 1$ correspond to iteration numbers. For example, $P R (p_{i}, 2)$ means "the PageRank of $p_{i}$ after iteration 2".

pagerank.c must take three command-line arguments: d (damping factor), diffPR (sum of PageRank differences), maxIterations (maximum number of iterations), and using the algorithm described in this section, calculate the PageRank for every page in the collection. CourseNana.COM

Here is an example of how the program is run: CourseNana.COM

./pagerank 0.85 0.00001 1000

The program should output (to a file named pagerankList.txt) a list of pages, one per line, along with some information about each page. Each line should begin with the URL of a page, followed by its outdegree and PageRank (using the format string "%.7lf"). The values should be comma-separated. Lines should be sorted in descending order by PageRank. CourseNana.COM

Here is an example pagerankList.txt CourseNana.COM

url31, 3, 0.2623546
url21, 1, 0.1843112
url34, 6, 0.1576851
url22, 4, 0.1520093
url32, 6, 0.0925755
url23, 4, 0.0776758
url11, 3, 0.0733884

Use format string "%.7f" to output PageRank values. Please note that your PageRank values might be slightly different to those provided in the sample tests. This might be due to the way you carry out calculations. However, make sure that your PageRank values match the expected values to the first 6 decimal places. For example, if the expected value was 0.0776758, your value could be 0.077675x where x could be any digit. CourseNana.COM

All the sample files were generated using the following command: CourseNana.COM

./pagerank 0.85 0.00001 1000

Part 2 - Inverted Index

Your task is to implement a program in invertedIndex.c that reads data from a given collection of pages in collection.txt and generates an "inverted index" that provides a sorted list (set) of URLs for every word in a given collection of pages. CourseNana.COM

One possible (and recommended) implementation of an inverted index is a binary search tree where each node contains a word and a corresponding file list, which is a linked list. Each node of the file list contains the filename of a file that contains the word. The inverted index tree must be ordered alphabetically (ascending) by word, and each file list must be sorted alphabetically (ascending) by filename. CourseNana.COM

The full inverted index will contain many inverted index nodes arranged in a binary search tree (ordered alphabetically by word), and each inverted index node will contain a file list. For example: CourseNana.COM

In this diagram, each of the blue linked lists contains the URLs of the pages which contain the corresponding word. Note that this is only a suggestion for how to implement this part. As long as your implementation works, and doesn't exceed the time limit of the autotests, you can receive full marks. CourseNana.COM

Before inserting words into your inverted index, you must normalise them by converting all letters to lowercase and removing the following punctuation marks from the end of the words: CourseNana.COM

. (dot)
, (comma)
: (colon)
; (semicolon)
? (question mark)
* (asterisk)

If there are multiple of the above punctuation marks at the end of a word, you should remove all of them. Punctuation marks at the start of a word or in the middle of a word should not be removed. Other punctuation marks should not be removed. If a word becomes empty after removing punctuation marks, then it should not be inserted into the inverted index. CourseNana.COM

Here are some examples of normalisation: CourseNana.COM

Word	Normalised word
Data	data
BSTs	bsts
algorithms.	algorithms
Why?	why
graphs*.	graphs
.NET	.net
unsw.edu.au.	unsw.edu.au
Sydney!	sydney!
.,!.,:;	.,!
new...........s	new...........s
*	(empty word)

Here is an example of how the program is run: CourseNana.COM

./invertedIndex

Your program should print an inverted index to a file named invertedIndex.txt. Each line of the output should begin with a word from the inverted index followed by a space-separated list of filenames. Lines should be ordered alphabetically (ascending) by the initial word, and each list of files should be ordered alphabetically (ascending) by filename. CourseNana.COM

Here is an example to demonstrate the expected format: CourseNana.COM

design url2 url25 url31 url61 
mars url101 url25 url31 
vegetation url31 url61

On each line, a word and all the URLs should be separated by one or more spaces. The testing program will ignore additional spaces. CourseNana.COM

Assumptions/Constraints

Relevant assumptions from previous parts apply.
When we say "alphabetical order", we mean as determined by strcmp.
All files required by the program (i.e., the collection file and all the files listed in the collection file) are valid readable/openable text files, and will be in the current working directory when the program is executed.
Words are at most 1000 characters in length.

Part 3 - Simple Search Engine

Your task is to implement a program in searchPagerank.c that takes one or more search terms and outputs relevant URLs. CourseNana.COM

The program must make use of the data available in two files: invertedIndex.txt and pagerankList.txt. All other text files are not relevant for this task and will not be present during testing (both the collection file, and all the individual URLs). CourseNana.COM

invertedIndex.txt contains data about what pages contain what words, as generated in Part 2. The format of this file is the same as the expected format of the output produced by your invertedIndex program in Part 2. CourseNana.COM

pagerankList.txt contains information about pages including their URL, outdegree and PageRank. The format of this file is the same as the expected format of the output produced by your pagerank program in Part 1. To simplify parsing, you may assume that the URL, out-degree, and PageRank are separated by ", " - that is, a comma followed by one space. CourseNana.COM

The program should take search terms as command-line arguments, find pages with one or more matching search terms and output the top 30 pages in descending order of the number of matching search terms to stdout. By matching search terms, we don't consider duplication; if a URL has mars appear in section 2 five times, and nothing else, searching for mars and design would count one matched search term, not five. You will not have access to any individual URL file for part 3, so you won't be able to determine how many occurrences there are regardless. Pages with the same number of matching search terms should be sorted in descending order by their PageRank. CourseNana.COM

Here is an example output: CourseNana.COM

./searchPagerank mars design
url31
url25

Note that unlike more sophisticated search engines, your simple search engine program should not treat variants of a word as the same word. For example, "observe" and "observing" should be treated as different words. Words and their plural forms, such as "planet" and "planets," should also be treated as different words. CourseNana.COM

We will test this part independently from your solution to Part 1 and Part 2. This means that we will use a reference solution with a correct pagerank.c and invertedIndex.c to generate pagerankList.txt and invertedIndex.txt to use with your program in this part. CourseNana.COM

Testing

We have provided three test cases for you to use. These test cases are very basic, but will give you a general idea whether your solution is working or not. You are expected to do your own testing. CourseNana.COM

To run these tests, a bash script is provided, which will change into each directory test* and run the test. To use it, assuming the script runtests is in your current directory, do the following: CourseNana.COM

./runtests
========== Running test 1 ==========
========== Test 1: ./pagerank 0.85 0.00001 1000 ==========
Outputs match!
========== Test 1: ./invertedIndex ==========
Outputs match!
========== Test 1: ./searchPagerank mars design ==========
Outputs match!
========== Running test 2 ==========
========== Test 2: ./pagerank 0.85 0.00001 1000 ==========
Outputs match!
========== Test 2: ./invertedIndex ==========
Outputs match!
========== Test 2: ./searchPagerank volcano pretzel hello design ==========
Outputs match!
========== Running test 3 ==========
========== Test 3: ./pagerank 0.85 0.00001 1000 ==========
Outputs match!
========== Test 3: ./invertedIndex ==========
Outputs match!
========== Test 3: ./searchPagerank volcano nasa ==========
Outputs match!

Total  tests: 9
Passed tests: 9
Failed tests: 0
Good job!

If you would only like to run specific tests, you can list them as arguments to the script. For example, to run only test 1 and test 3: CourseNana.COM

./runtests 1 3
========== Running test 1 ==========
========== Test 1: ./pagerank 0.85 0.00001 1000 ==========
Outputs match!
========== Test 1: ./invertedIndex ==========
Outputs match!
========== Test 1: ./searchPagerank mars design ==========
Outputs match!
========== Running test 3 ==========
========== Test 3: ./pagerank 0.85 0.00001 1000 ==========
Outputs match!
========== Test 3: ./invertedIndex ==========
Outputs match!
========== Test 3: ./searchPagerank volcano nasa ==========
Outputs match!

Total  tests: 6
Passed tests: 6
Failed tests: 0
Good job!

You may also make your own tests, by creating a directory starting with test. Any directory starting with test will be included as a test when you run the runtests script. Testing Part 1 manually may be difficult, as the PageRanks aren't easy to calculate by hand, but you may still want to test Parts 2 and 3 using your own test directories. If a *.exp file is missing, the script will simply skip that part for that test. A typical test directory will look like this: CourseNana.COM

test4/
    collection.txt
    (url files)
    pagerankList.exp
    invertedIndex.exp
    searchTerms.txt
    searchPagerank.exp

Testing Part 3 requires an extra file beyond the expected output. The searchTerms.txt file contains the terms to search for, separated by spaces, on a single line. Take a look at the three provided test cases for how this works. CourseNana.COM

Debugging

Debugging is an important and inevitable aspect of programming. We expect everyone to have an understanding of basic debugging methods, including using print statements, knowing basic GDB commands and running Valgrind. In the forum and in help sessions, tutors will expect you to have made a reasonable attempt to debug your program yourself using these methods before asking for help, and you should be able to explain your debugging attempts. CourseNana.COM

You can learn about GDB and Valgrind in the Debugging with GDB and Valgrind lab exercise. CourseNana.COM

COMP2521 Data Structures and Algorithms - Assignment: Simple Graph Structure-Based Search Engine

Mock web pages

File Reading

Assumptions/Constraints

Get in Touch with Our Experts