INFS7410 Project - Part 1
Preamble CourseNana.COM
This part of project is worth 20% of the overall mark for INFS7410 (part 1 + part 2 = 40%). A detailed marking sheet for this assignment is provided alongside this notebook. The project is to be completed individually. CourseNana.COM
We recommend that you make an early start on this assignment, and proceed by steps. There are a number of activities you may have already tackled, including setting up the pipeline, manipulating the queries, implement some retrieval functions, and performing evaluation and analysis. Most of the assignment relies on knowledge and code you should have already have experienced in the computer practicals, however there are some hidden challenges here and there that you may require some time to solve. CourseNana.COM
Aim
Project aim: The aim of this project is to implement a number of classical information retrieval methods, evaluate them and compare them in the context of a real use-case. CourseNana.COM
Project Part 1 aim
The aim of Part 1 is to: CourseNana.COM
- Setup your infrastructure to index the collection and evaluate queries.
- Implement common information retrieval baselines.
- Tune your retrieval implementations to improve their effectiveness.
- Implement rank fusion methods.
The Information Retrieval Task: Web Passage Ranking
In this project we will consider the problem of open-domain passage ranking in answer to web queries. In this context, users pose queries to the search engine and expect answers in the form of a ranked list of passages (maximum 1000 passages to be retrieved). CourseNana.COM
The provided queries are real queries submitted to the Microsoft Bing search engine. In the collection, there are approximately 8.8 million passages and the goal is to rank them based on their relevance to the queries. CourseNana.COM
What we provide you with:
Files from practical
- A collection of 8.8 million text passages extracted from web pages (
collection.tsv
— provided in Week 1).
Extra files for this project
- A query dev file that contains 30 queries for you to perform retrieval experiments (
data/dev_queries.tsv
). - A query dev file that contains 30 queries (same query ids with previous one, but with typos in the query text (
data/dev_typo_queries.tsv
). - A qrel file that contains relevance judgements for you to tune your methods (
data/dev.qrels
— provided in Week 2). - A leaderboard system for you to evaluate how well your system performs.
- A test query file that contains 60 queries for you to generate run files to submit to the leaderboard, some queries may contain typos (
data/test_queries.tsv
). - This jupyter notebook, which you will include inside it your implementation and report.
Put this notebook and provided files under the same directory. CourseNana.COM
What you need to produce
You need to produce: CourseNana.COM
- Correct implementations of the methods required by this project specifications.
- An explanation of the retrieval methods used, including the formulas that represent the models you implemented and code that implements that formula, an explanation of the evaluation settings followed, and a discussion of the findings.
You are required to produce both of these within this jupyter notebook. CourseNana.COM
Required methods to implement
In Part 1 of the project, you are required to implement the following retrieval methods. All implementations should be based on your own code. CourseNana.COM
- BM25: Create your own implementation, do not use the Pynserini API's implementation. See the videos in Week 3 for background information.
- Pseudo-relevance feedback using BM25 for query expansion: create your own implementation using the Pyserini API to extract index statistics. See Week 5 practical for background information.
- IDF-r query reduction: create your own implementation using the Pyserini API to extract index statistics. See Week 5 practical for background information.
- The rank fusion method Borda; you need to create your own implementation of this. See Week 4 practical for background information.
- The rank fusion method CombSUM; you need to create your own implementation of this. See Week 4 practical for background information.
- The rank fusion method CombMNZ; you need to create your own implementation of this. See Week 4 practical for background information.
For your BM25, query expansion and query reduction implementations, you are also required to tune the parameters of these methods. You must perform a parameter search over at least 5 sensibly chosen parameter values depending on the method (10 when the method has two parameters). CourseNana.COM
For the rank fusion methods, consider fusing the highest-performing tuned run from each of the BM25, query expansion and query reduction implementations. CourseNana.COM
You should have already attempted many of these implementations above as part of the computer pracs exercises. CourseNana.COM
Required evaluation to perform
In this project, we consider two types of queries, one of them contains typos
(i.e. typographical mistakes, like writing iformation
for information
, and another one with the typos resolved. An important aspect of the evaluation in the project is to compare the retrieval behaviour of search methods on queries with and without typos.
CourseNana.COM
In Part 1 of the project, you are required to perform the following evaluation: CourseNana.COM
- For all methods, tuning the parameters with
data/dev_typo_queries.tsv
(this contains the queries with typos),data/dev_queries.tsv
(this contains queries that refer to the same queries with typos provided, but for which typos have been corrected), anddata/dev.qrels
(i.e., use this data to tune any parameter of a retrieval model, e.g. b and k1 for BM25, etc.) and submit your runs on thedata/test_queries.tsv
using the parameter values you selected from the dev sets to the leaderboard system. - Report the results of every method on the
data/dev_queries.tsv
anddata/dev_typo_queries.tsv
(only for the run you selected the tuned parameters from) separately, into a table. Perform statistical significance analysis across the results of the methods and report them in the tables, i.e. you would compare results obtained by method A on queries with typos and results otained by method B on the same set of queries. - Produce a gain-loss plot that compares BM25 vs. Pseudo-relevance feedback query expansion using BM25; and plots that compare BM25 vs. each rank fusion method on the dev set.
- Comment on trends and differences observed when comparing your findings. Is there a method consistently outperforming the others on the
data/dev_queries.tsv
/data/dev_typo_queries.tsv
and thetest_queries.tsv
? - Provide insights of whether rank fusion works, or if it does not, e.g., with respect to runs to be considered in the fusion process, queries, etc.
- Provide insights about results obtained by typo queries, compared to those without typos: do ranking methods work, why they work or don't work, etc.. In your discussion, you should consider the results of the statistical significance analysis perform between results obtained by the same method on queries with and without typos.
In terms of evaluation measures, evaluate the retrieval methods with respect to nDCG at 10 (ndcg_cut_10
). You should use this measure as the target measure for tuning. Also compute reciprocal rank at 1000 (recip_rank
), MAP (map
) and Recall at 1000 (recall_1000
).
CourseNana.COM
For all gain-loss plots, produce them with respect to nDCG at 10. CourseNana.COM
For all statistical significance analysis, use paired t-test; distinguish between p<0.05 and p<0.01. CourseNana.COM
How to submit
You will have to submit one file: CourseNana.COM
- A zip file containing this notebook (
.ipynb
) and this notebook as a PDF document. The code should be able to be executed by us. Remember to include all your discussion and analysis also in this notebook and not as a separate file.
Leaderboard Challenge
As part of this project, we present you the leaderboard challenge, where you can submit your best runs and compare them with others. You can submit multiple times during the whole process of your project. CourseNana.COM
For models, you can use any retrieval or rerank or fusion methods (including those you implement for the pracs and for the assessment), even combine multiple models together, not limited to the models we have learned in this class. You are allowed to be creative and come up with your own retrieval or rerank models. CourseNana.COM
Although multiple run files can be submitted, we recommend you to run your models with the queries we provided in pracs first, only to check if your model is valid. Then you can run on the test queries for your project, and submit it to leaderboard. CourseNana.COM
Have fun!
CourseNana.COM
Initialise packages and functions
You will need to decide which index, stemming algorithm and keeping stopwords or not in the following cell. You may want to try out different indexes and if you don't have one, follow week1 prac to create one (remember to add -storeDocvectors
).
CourseNana.COM
stemming = None # None or 'poter' or anything else
stopwords = False # False or True
index = 'indexes/lucene-index-msmarco-passage-xxxx/'
Run the following cell to load and cache some useful packages and statistics that you will use later. CourseNana.COM
from pyserini.search import SimpleSearcher
from pyserini.analysis import Analyzer, get_lucene_analyzer
from pyserini.index import IndexReader
from tqdm import tqdm
# Create document frequency dictionary to speed up scoring later, this will take around 2 min.
df_dict = {}
for term in tqdm(index_reader.terms(), desc="loading idf dictionary:"):
df_dict[term.term] = term.df
# cache document length and docids for the collection, this will take around 2 mins.
doc_len_dict = {}
doc_id_dict = {}
with open ('collection/collection.tsv', 'r') as f:
lines = f.readlines()
for line in tqdm(lines, desc="loading doc_length dictionary:"):
docid, text = line.split('\t')
doc_len_dict[docid] = len(text.split())
internal_id = index_reader.convert_collection_docid_to_internal_docid(docid)
doc_id_dict[internal_id] = docid
Understand and run the following cell to define the search function. CourseNana.COM
NOTE: This search function is different from the search function in week3 prac. When you implement methods yourself, make sure to use this search function which we implemented with iterating posting lists, do not use the search function from Week 3 prac, as the Week 3 prac is only re-ranking BM25 results. CourseNana.COM
def search(query: str, k: int=1000, scorer=None):
"""
Inputs:
query (str): the query string to perform the search.
k (int): the number of documents to be returned.
scorer: your implemented scoring function, such as bm25.
Output:
results (list): the sorted result list, a list of tuples.
The first element in the tuples is the docid,
the second is the doc score.
"""
assert scorer is not None
print("-----------------------------------------------------")
print("Current query:", query)
# get the analyzed term list
q_terms = analyzer.analyze(query)
doc_socres = {}
for term in q_terms:
# get the posting list for the current term
postings_list = index_reader.get_postings_list(term, analyzer=None)
if postings_list is not None:
# get the document frequency of the current term
df = df_dict[term]
# iterate the posting list
for posting in tqdm(postings_list, desc=f"Iterate posting for term '{term}'"):
# Call the scoring function (you will implement these below).
score = scorer(tf, df, doc_len)
if docid in doc_socres.keys():
doc_socres[docid] += score
else:
doc_socres[docid] = score
# Sort the results by the score.
results = [(docid, doc_socre) for docid, doc_socre in doc_socres.items()]
results = sorted(results, key=lambda x: x[1], reverse=True)[:k]
return results
print("-----------------------------------------------------")
After this line, feel free to edit this notebook however you like. You can use the following cells as a template to guide you in completing this project. CourseNana.COM
# Import all your python libraries and put setup code here.
Double-click to edit this markdown cell and describe the first method you are going to implement, e.g., BM25 CourseNana.COM
# Put your implementation of BM25 here, including parameter tuning.
When you have described and provided implementations for each method, include a table with statistical analysis here. CourseNana.COM
For convenience, you can use tools like this one to make it easier: https://www.tablesgenerator.com/markdown_tables, or if you are using pandas, you can convert dataframes to markdown https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_markdown.html CourseNana.COM
Then you can edit this cell to describe the gain-loss plots you will create below. CourseNana.COM
# Put your implementations for the gain-loss plots here.
Then you can edit this cell to provide some analysis about if there is a method that consistently outperform the others on the dev set and the test set. CourseNana.COM
Then you can edit this cell to provide insights of whether rank fusion works, or if it does not. CourseNana.COM