Homepage
Programming
Final Exam: Inverted index and information retrieval with Spark

Final Exam: Inverted index and information retrieval with Spark

Engage in a Conversation

Final Exam: Inverted index and information retrieval with Spark CourseNana.COM

Published Date: May 12, 2024 CourseNana.COM

May 20 2024 @5:00pm CourseNana.COM

************************************************ This is an individual assignment. ************************************************ CourseNana.COM

Overview and Assignment Goals: CourseNana.COM

The objectives of this assignment are the following: CourseNana.COM

- Use Apache Spark to build an Inverted index CourseNana.COM
- Retrieve a list of 5 documents ranked by query cosine similarity. CourseNana.COM
- Perform the above tasks on at least 2 Spark workers. CourseNana.COM

Detailed Description: CourseNana.COM

Build an inverted index and retrieve relevant documents for the queries. CourseNana.COM

Information retrieval is the science of searching for information in a document or collection of documents. In this assignment you are given a collection of documents and a set of queries. The main tasks for this assignment are: CourseNana.COM

- Use Apache spark to pre-process the documents in parallel CourseNana.COM
- Build an inverted index for the collection of documents CourseNana.COM
- Try adding more information to the inverted index to support building of CourseNana.COM

weighting schemes such as Tf-Idf CourseNana.COM
- Use precision and recall measure to evaluate relevance of query results CourseNana.COM
- Fine tune your approach on the training set of documents and queries CourseNana.COM
- Submit your ranking lists for the test set and queries CourseNana.COM

You can use any libraries in this assignment, except for building and querying the inverted index. CourseNana.COM

The goal of this assignment is to allow you to develop an inverted index and information retrieval methods that return a set of relevant documents to a query. CourseNana.COM

To evaluate the performance of your results we will use the precision@5 and recall@5 metric (i.e. measuring precision and recall for the five most relevant documents identified). CourseNana.COM

Caveats: CourseNana.COM

+ Use the methods and data structure knowledge you have gained until now, wisely, to optimize your results. CourseNana.COM

+ The default memory assigned to the Spark runtime may not be enough to process this data file, depending on how you write your algorithm. If your program fails with CourseNana.COM

java.lang.OutOfMemoryError: Java heap space CourseNana.COM

then you'll need to increase the memory assigned to the Spark runtime. In general spark uses --driver-memory to set the runtime memory. On the HPC we will be running spark on top of Slurm, thus spark can only get as much memory as slurm allocates. CourseNana.COM

Data Description: CourseNana.COM

The document dataset consists of 3612 documents (docs.dat). You are provided with a set of training queries (train.queries) and the relevant document IDs for each training query (train.rel), and a set of test queries (test.queries). The data are provided as text in docs.dat, which should be processed appropriately. CourseNana.COM

docs.dat: Documents set (DocumentID <tab separator> Document text) train.queries: Queries training set (Query ID <tab separator> Query text). test.queries: Queries test set (Query ID <tab separator> Query text). train.rel: Relevant documents for the training queries (Training Query ID<Tab separator>Document ID). CourseNana.COM

- -------------------------
TO SUBMIT OUTPUT FILE: <YourSJSU_ID>.rel: Relevant documents for the test queries CourseNana.COM

(Training Query ID<Tab separator>Document ID): Five documents per query. CourseNana.COM

Deliverables: CourseNana.COM

- Valid submissions (report, code and output) to Canvas. Zip all files in a .zip archive. CourseNana.COM
- Canvas submission of the report: CourseNana.COM

- The report should describe the steps you followed for developing your final solution. Be sure to include the following in the report (No more than 2 pages): CourseNana.COM

o Name, SJSU ID CourseNana.COM

o Data pre-processing steps and other approaches. CourseNana.COM

o A description of the algorithm(s) you used as well as any associated parameters. CourseNana.COM

CourseNana.COM

- Failure to follow the above guidelines (missing files, etc.) will cost you points. CourseNana.COM

Rules: CourseNana.COM

- This is an individual assignment. Discussions with colleagues are not allowed CourseNana.COM

and any copying of prediction files and source codes will result in an honor code CourseNana.COM

violation. CourseNana.COM
- The inverted index should be implemented in Spark. Feel free to use the CourseNana.COM

programming language of your choice for this assignment (Python, Scala, or Java). CourseNana.COM
- No late submission will be allowed. CourseNana.COM

Grading: CourseNana.COM

- Grading for the Assignment will be split on your implementation (70%) and report (30%). CourseNana.COM

- Implementation: CourseNana.COM

o Preprocessing (10)
o Building of inverted index (15)
o Weighting document vectors (15) o Similarity calculation (15)
o Output generation (15) CourseNana.COM

- Report CourseNana.COM

Attachment Files: On Canvas CourseNana.COM

o Data pre-processing steps and other approaches. (15) CourseNana.COM

o A description of the algorithm(s) you used as well as any associated parameters (15) CourseNana.COM

Get in Touch with Our Experts

WeChat (微信)

SJSU代写,Spark代写,Inverted index代写,Information retrieval代写,SJSU代编,Spark代编,Inverted index代编,Information retrieval代编,SJSU代考,Spark代考,Inverted index代考,Information retrieval代考,SJSUhelp,Sparkhelp,Inverted indexhelp,Information retrievalhelp,SJSU作业代写,Spark作业代写,Inverted index作业代写,Information retrieval作业代写,SJSU编程代写,Spark编程代写,Inverted index编程代写,Information retrieval编程代写,SJSUprogramming help,Sparkprogramming help,Inverted indexprogramming help,Information retrievalprogramming help,SJSUassignment help,Sparkassignment help,Inverted indexassignment help,Information retrievalassignment help,SJSUsolution,Sparksolution,Inverted indexsolution,Information retrievalsolution,