Final Exam: Inverted index and information retrieval with Spark |
Published Date: May 12, 2024
Due Date:
May 20 2024 @5:00pm
Description:
************************************************ This is an individual assignment. ************************************************
Overview and Assignment Goals:
The objectives of this assignment are the following:
-
- Use Apache Spark to build an Inverted index
-
- Retrieve a list of 5 documents ranked by query cosine similarity.
-
- Perform the above tasks on at least 2 Spark workers.
Detailed Description:
Build an inverted index and retrieve relevant documents for the queries.
Information retrieval is the science of searching for information in a document or collection of documents. In this assignment you are given a collection of documents and a set of queries. The main tasks for this assignment are:
-
- Use Apache spark to pre-process the documents in parallel
-
- Build an inverted index for the collection of documents
-
- Try adding more information to the inverted index to support building of
weighting schemes such as Tf-Idf
-
- Use precision and recall measure to evaluate relevance of query results
-
- Fine tune your approach on the training set of documents and queries
-
- Submit your ranking lists for the test set and queries
You can use any libraries in this assignment, except for building and querying the inverted index.
The goal of this assignment is to allow you to develop an inverted index and information retrieval methods that return a set of relevant documents to a query.
To evaluate the performance of your results we will use the precision@5 and recall@5 metric (i.e. measuring precision and recall for the five most relevant documents identified).
Caveats:
+ Use the methods and data structure knowledge you have gained until now, wisely, to optimize your results.
+ The default memory assigned to the Spark runtime may not be enough to process this data file, depending on how you write your algorithm. If your program fails with
java.lang.OutOfMemoryError: Java heap space
then you'll need to increase the memory assigned to the Spark runtime. In general spark uses --driver-memory to set the runtime memory. On the HPC we will be running spark on top of Slurm, thus spark can only get as much memory as slurm allocates.
Data Description:
The document dataset consists of 3612 documents (docs.dat). You are provided with a set of training queries (train.queries) and the relevant document IDs for each training query (train.rel), and a set of test queries (test.queries). The data are provided as text in docs.dat, which should be processed appropriately.
docs.dat: Documents set (DocumentID <tab separator> Document text) train.queries: Queries training set (Query ID <tab separator> Query text). test.queries: Queries test set (Query ID <tab separator> Query text). train.rel: Relevant documents for the training queries (Training Query ID<Tab separator>Document ID).
- -------------------------
TO SUBMIT OUTPUT FILE: <YourSJSU_ID>.rel: Relevant documents for the test queries
(Training Query ID<Tab separator>Document ID): Five documents per query.
Deliverables:
-
- Valid submissions (report, code and output) to Canvas. Zip all files in a .zip archive.
-
- Canvas submission of the report:
- The report should describe the steps you followed for developing your final solution. Be sure to include the following in the report (No more than 2 pages):
o Name, SJSU ID
o Data pre-processing steps and other approaches.
o A description of the algorithm(s) you used as well as any associated parameters.
- Failure to follow the above guidelines (missing files, etc.) will cost you points.
Rules:
-
- This is an individual assignment. Discussions with colleagues are not allowed
and any copying of prediction files and source codes will result in an honor code
violation.
-
- The inverted index should be implemented in Spark. Feel free to use the
programming language of your choice for this assignment (Python, Scala, or Java).
-
- No late submission will be allowed.
Grading:
- Grading for the Assignment will be split on your implementation (70%) and report (30%).
- Implementation:
o Preprocessing (10)
o Building of inverted index (15)
o Weighting document vectors (15)
o Similarity calculation (15)
o Output generation (15)
- Report
Attachment Files: On Canvas
o Data pre-processing steps and other approaches. (15)
o A description of the algorithm(s) you used as well as any associated parameters (15)