1. Homepage
  2. Programming
  3. COMPSCI 753: Algorithms for Massive Data Assignment 1: Locality-sensitive Hashing

COMPSCI 753: Algorithms for Massive Data Assignment 1: Locality-sensitive Hashing

Engage in a Conversation
COMPSCI 753Algorithms for Massive Data澳洲AucklandPythonSearch QualityF1-scoreMinhash

COMPSCI 753: Algorithms for Massive Data Assignment 1: Locality-sensitive Hashing (Worth 5 Pts in Total) CourseNana.COM


CourseNana.COM

Due date: 23:59 14 August 2022 CourseNana.COM

Learning Objectives: The goal of this assignment is to investigate Locality-sensitive Hashing (LSH) framework for near neighbor search on real-world datasets. We have learned how to construct hash tables using multiple hash functions in LSH family for near neighbor retrieval in sublinear time during weeks 2-3. By accomplishing this assignment, you will be familiar with the following concepts: CourseNana.COM

  • Hash Table Construction
  • Hash Table Lookup
  • Search Quality Evaluation

General Instruction: CourseNana.COM

This core component in this assignment is to construct a document retrieval system upon the LSH framework. This assignment consists of three parts. Please write a python program to complete the following components: CourseNana.COM

  • Part I: Construct LSH Hash Tables for All News Articles
  • Part II: Perform Nearest Neighbor Search for Query Dataset
  • Part III: Investigate the Impact of the hash size (k). Plots the Search Quality in F1-score.

Datasets: CourseNana.COM

Let’s consider two classes of BBC news articles: tech news and entertainment news. In bitvector_all.csv and bitvector_query.csv, each news article is a tab-separated line with three columns: <news_id\tword_features\tnews class>, where news_id is a unique string ID, word_features is a sequence of tab-separated binary values. Each entry in the word_features refers to the occurrence of a token. You can find their original new articles in text_all.csv and text_query.csv, accordingly in bbc.zip on Canvas. CourseNana.COM

Submission:
Please submit a single report (.pdf) and the source code with detailed comments.(.py or .ipynb or .html) on Canvas by 23:59, Sunday 14 August 2022. The answer file must contain your studentID, UPI and name. CourseNana.COM

Penalty Dates: CourseNana.COM

The assignment will not be accepted after the last penalty date unless there are special circumstances (e.g., sickness with medical certificate, family/personal emergencies). Penalties will be calculated as follows as a percentage of the mark for the assignment. CourseNana.COM

  • By 23:59, Sunday 14 August 2022 (No penalty)
  • By 23:59, Monday 15 August 2022 (25% penalty)
  • By 23:59, Tuesday 16 August 2022 (50% penalty)

Part I: Construct LSH Hash Tables for All News Articles [40 pts] CourseNana.COM

(a) Load bitvector_all.csv and bitvector_query.csv. Construct a feature vector for each news article in the dataset. Please report the number of articles, and the number of features (?) for these two sets of data. [5 pts] CourseNana.COM

(b) Construct a family of MinHash functions in the LSH family by taking a prime ?? and for 0 < ? < ?, 0 ≤ ? < ? with the number of tables (l=10) and a tunable choice of hash size (k). Please report the family of MinHash functions you have generated with l=10 and k=2. [15 pts] CourseNana.COM

(c) Construct LSH hash tables using your hash functions with the number of tables (l=10) and bucket size of your choice (m). Please report the collision distribution of the l hash tables with all documents hashed into m buckets using heatmap plot, where x-axis is m, y-axis is l=10, and the values at (m,l) refers to the number of colliding articles). [20 pts] CourseNana.COM

Part II: Nearest Neighbor Search [35 pts] CourseNana.COM

(a) Query the LSH tables and return the top-10 articles that have the highest Jaccard similarities as the answer. For each query document q in our queries dataset ?, firstly, find the set of articles Dq that collide with q in at least one hash table. Compute Jaccard similarity between q and each article in Dq . Please report the list of top-10 articles with highest Jaccard similarity in descending order for each query q (i.e., four lists in total). The article with the highest Jaccard similarity is ranked at 1. Each row of the list is of the form <news_id> <Jaccard_sim> <class_label> for one query q. [20 pts] CourseNana.COM

(b) Compute Jaccard similarity for query q and all articles in the dataset. Please report the list of top-10 articles with highest Jaccard similarity in descending order for each query q (i.e., four lists in total). [10 pts] CourseNana.COM

(c) Compare the query time in Part II(a) and Part II(b) per query in milliseconds and comment on their differences if any. [5 pts] CourseNana.COM

Part III: Search Quality Evaluation [25 pts] CourseNana.COM

(a) Investigate the impact of the hash size (k). Given l=10, for each value of hash size k compute the F1-score for each query q (?1? ) using the reported result from query q in Part II(a) CourseNana.COM

as search results and Part II(b) as ground-truth. Take the average of F1-score across all queries at k. Please report: CourseNana.COM

  1. the F1-score plot with a varying k=[2,4,8]. (Note: ?1 = 1 ?1 , |?| ?? ?

??
?1? = ?? + 1/2 (??+??) , where ?? (??/??) refers to the number of true positives (false CourseNana.COM

positives / false negatives) of the top-? similar articles in the ground-truth for the query q CourseNana.COM

(? is capped at 10) CourseNana.COM

  1. the average query time in milliseconds with a varying k=[2,4,8]. [20 pts]

(b) Explain what you have observed from Part III(a) and suggest how you would tune the number of hash size (k) in terms of higher F1-score and lesser query time, respectively? [5 pts] CourseNana.COM


CourseNana.COM

  CourseNana.COM

Get in Touch with Our Experts

WeChat (微信) WeChat (微信)
Whatsapp WhatsApp
COMPSCI 753代写,Algorithms for Massive Data代写,澳洲代写,Auckland代写,Python代写,Search Quality代写,F1-score代写,Minhash代写,COMPSCI 753代编,Algorithms for Massive Data代编,澳洲代编,Auckland代编,Python代编,Search Quality代编,F1-score代编,Minhash代编,COMPSCI 753代考,Algorithms for Massive Data代考,澳洲代考,Auckland代考,Python代考,Search Quality代考,F1-score代考,Minhash代考,COMPSCI 753help,Algorithms for Massive Datahelp,澳洲help,Aucklandhelp,Pythonhelp,Search Qualityhelp,F1-scorehelp,Minhashhelp,COMPSCI 753作业代写,Algorithms for Massive Data作业代写,澳洲作业代写,Auckland作业代写,Python作业代写,Search Quality作业代写,F1-score作业代写,Minhash作业代写,COMPSCI 753编程代写,Algorithms for Massive Data编程代写,澳洲编程代写,Auckland编程代写,Python编程代写,Search Quality编程代写,F1-score编程代写,Minhash编程代写,COMPSCI 753programming help,Algorithms for Massive Dataprogramming help,澳洲programming help,Aucklandprogramming help,Pythonprogramming help,Search Qualityprogramming help,F1-scoreprogramming help,Minhashprogramming help,COMPSCI 753assignment help,Algorithms for Massive Dataassignment help,澳洲assignment help,Aucklandassignment help,Pythonassignment help,Search Qualityassignment help,F1-scoreassignment help,Minhashassignment help,COMPSCI 753solution,Algorithms for Massive Datasolution,澳洲solution,Aucklandsolution,Pythonsolution,Search Qualitysolution,F1-scoresolution,Minhashsolution,