INFS7410 Project - Part 1: Classical information retrieval methods

Engage in a Conversation

INFS7410 Project - Part 1

This part of project is worth 20% of the overall mark for INFS7410 (part 1 + part 2 = 40%). A detailed marking sheet for this assignment is provided alongside this notebook. The project is to be completed individually. CourseNana.COM

We recommend that you make an early start on this assignment, and proceed by steps. There are a number of activities you may have already tackled, including setting up the pipeline, manipulating the queries, implement some retrieval functions, and performing evaluation and analysis. Most of the assignment relies on knowledge and code you should have already have experienced in the computer practicals, however there are some hidden challenges here and there that you may require some time to solve. CourseNana.COM

Aim

Project aim: The aim of this project is to implement a number of classical information retrieval methods, evaluate them and compare them in the context of a real use-case. CourseNana.COM

Project Part 1 aim

The aim of Part 1 is to: CourseNana.COM

Setup your infrastructure to index the collection and evaluate queries.
Implement common information retrieval baselines.
Tune your retrieval implementations to improve their effectiveness.
Implement rank fusion methods.

The Information Retrieval Task: Web Passage Ranking

In this project we will consider the problem of open-domain passage ranking in answer to web queries. In this context, users pose queries to the search engine and expect answers in the form of a ranked list of passages (maximum 1000 passages to be retrieved). CourseNana.COM

The provided queries are real queries submitted to the Microsoft Bing search engine. In the collection, there are approximately 8.8 million passages and the goal is to rank them based on their relevance to the queries. CourseNana.COM

What we provide you with:

Files from practical

A collection of 8.8 million text passages extracted from web pages (collection.tsv— provided in Week 1).

Extra files for this project

A query dev file that contains 30 queries for you to perform retrieval experiments (data/dev_queries.tsv).
A query dev file that contains 30 queries (same query ids with previous one, but with typos in the query text (data/dev_typo_queries.tsv).
A qrel file that contains relevance judgements for you to tune your methods (data/dev.qrels— provided in Week 2).
A leaderboard system for you to evaluate how well your system performs.
A test query file that contains 60 queries for you to generate run files to submit to the leaderboard, some queries may contain typos (data/test_queries.tsv).
This jupyter notebook, which you will include inside it your implementation and report.

Put this notebook and provided files under the same directory. CourseNana.COM

What you need to produce

You need to produce: CourseNana.COM

Correct implementations of the methods required by this project specifications.
An explanation of the retrieval methods used, including the formulas that represent the models you implemented and code that implements that formula, an explanation of the evaluation settings followed, and a discussion of the findings.

You are required to produce both of these within this jupyter notebook. CourseNana.COM

Required methods to implement

In Part 1 of the project, you are required to implement the following retrieval methods. All implementations should be based on your own code. CourseNana.COM

BM25: Create your own implementation, do not use the Pynserini API's implementation. See the videos in Week 3 for background information.
Pseudo-relevance feedback using BM25 for query expansion: create your own implementation using the Pyserini API to extract index statistics. See Week 5 practical for background information.
IDF-r query reduction: create your own implementation using the Pyserini API to extract index statistics. See Week 5 practical for background information.
The rank fusion method Borda; you need to create your own implementation of this. See Week 4 practical for background information.
The rank fusion method CombSUM; you need to create your own implementation of this. See Week 4 practical for background information.
The rank fusion method CombMNZ; you need to create your own implementation of this. See Week 4 practical for background information.

For your BM25, query expansion and query reduction implementations, you are also required to tune the parameters of these methods. You must perform a parameter search over at least 5 sensibly chosen parameter values depending on the method (10 when the method has two parameters). CourseNana.COM

For the rank fusion methods, consider fusing the highest-performing tuned run from each of the BM25, query expansion and query reduction implementations. CourseNana.COM

You should have already attempted many of these implementations above as part of the computer pracs exercises. CourseNana.COM

Required evaluation to perform

In this project, we consider two types of queries, one of them contains typos (i.e. typographical mistakes, like writing iformation for information, and another one with the typos resolved. An important aspect of the evaluation in the project is to compare the retrieval behaviour of search methods on queries with and without typos. CourseNana.COM

In Part 1 of the project, you are required to perform the following evaluation: CourseNana.COM

For all methods, tuning the parameters with data/dev_typo_queries.tsv (this contains the queries with typos), data/dev_queries.tsv (this contains queries that refer to the same queries with typos provided, but for which typos have been corrected), and data/dev.qrels (i.e., use this data to tune any parameter of a retrieval model, e.g. $b$ and $k 1$ for BM25, etc.) and submit your runs on the data/test_queries.tsv using the parameter values you selected from the dev sets to the leaderboard system.
Report the results of every method on the data/dev_queries.tsv and data/dev_typo_queries.tsv (only for the run you selected the tuned parameters from) separately, into a table. Perform statistical significance analysis across the results of the methods and report them in the tables, i.e. you would compare results obtained by method A on queries with typos and results otained by method B on the same set of queries.
Produce a gain-loss plot that compares BM25 vs. Pseudo-relevance feedback query expansion using BM25; and plots that compare BM25 vs. each rank fusion method on the dev set.
Comment on trends and differences observed when comparing your findings. Is there a method consistently outperforming the others on the data/dev_queries.tsv/data/dev_typo_queries.tsv and the test_queries.tsv?
Provide insights of whether rank fusion works, or if it does not, e.g., with respect to runs to be considered in the fusion process, queries, etc.
Provide insights about results obtained by typo queries, compared to those without typos: do ranking methods work, why they work or don't work, etc.. In your discussion, you should consider the results of the statistical significance analysis perform between results obtained by the same method on queries with and without typos.

In terms of evaluation measures, evaluate the retrieval methods with respect to nDCG at 10 (ndcg_cut_10). You should use this measure as the target measure for tuning. Also compute reciprocal rank at 1000 (recip_rank), MAP (map) and Recall at 1000 (recall_1000). CourseNana.COM

For all gain-loss plots, produce them with respect to nDCG at 10. CourseNana.COM

For all statistical significance analysis, use paired t-test; distinguish between p<0.05 and p<0.01. CourseNana.COM

How to submit

You will have to submit one file: CourseNana.COM

A zip file containing this notebook (.ipynb) and this notebook as a PDF document. The code should be able to be executed by us. Remember to include all your discussion and analysis also in this notebook and not as a separate file.

Leaderboard Challenge

As part of this project, we present you the leaderboard challenge, where you can submit your best runs and compare them with others. You can submit multiple times during the whole process of your project. CourseNana.COM

For models, you can use any retrieval or rerank or fusion methods (including those you implement for the pracs and for the assessment), even combine multiple models together, not limited to the models we have learned in this class. You are allowed to be creative and come up with your own retrieval or rerank models. CourseNana.COM

Although multiple run files can be submitted, we recommend you to run your models with the queries we provided in pracs first, only to check if your model is valid. Then you can run on the test queries for your project, and submit it to leaderboard. CourseNana.COM

Have fun!
CourseNana.COM

Initialise packages and functions

You will need to decide which index, stemming algorithm and keeping stopwords or not in the following cell. You may want to try out different indexes and if you don't have one, follow week1 prac to create one (remember to add -storeDocvectors). CourseNana.COM

In [ ]:

stemming = None    # None or 'poter' or anything else
stopwords = False  # False or True
index = 'indexes/lucene-index-msmarco-passage-xxxx/'

Run the following cell to load and cache some useful packages and statistics that you will use later. CourseNana.COM

In [ ]:

from pyserini.search import SimpleSearcher
from pyserini.analysis import Analyzer, get_lucene_analyzer
from pyserini.index import IndexReader
from tqdm import tqdm

# Create document frequency dictionary to speed up scoring later, this will take around 2 min.
df_dict = {}
for term in tqdm(index_reader.terms(), desc="loading idf dictionary:"):
        df_dict[term.term] = term.df

# cache document length and docids for the collection, this will take around 2 mins.
doc_len_dict = {}
doc_id_dict = {}
with open ('collection/collection.tsv', 'r') as f:
    lines = f.readlines()
    for line in tqdm(lines, desc="loading doc_length dictionary:"):
        docid, text = line.split('\t')
        doc_len_dict[docid] = len(text.split())
        internal_id = index_reader.convert_collection_docid_to_internal_docid(docid)
        doc_id_dict[internal_id] = docid

Understand and run the following cell to define the search function. CourseNana.COM

NOTE: This search function is different from the search function in week3 prac. When you implement methods yourself, make sure to use this search function which we implemented with iterating posting lists, do not use the search function from Week 3 prac, as the Week 3 prac is only re-ranking BM25 results. CourseNana.COM

In [ ]:

def search(query: str, k: int=1000, scorer=None):
    """
    Inputs: 
        query (str): the query string to perform the search.
        k (int): the number of documents to be returned.
        scorer: your implemented scoring function, such as bm25.
    
    Output:
        results (list): the sorted result list, a list of tuples.
                        The first element in the tuples is the docid, 
                        the second is the doc score.
    """
    
    assert scorer is not None
    print("-----------------------------------------------------")
    print("Current query:", query)

    # get the analyzed term list
    q_terms = analyzer.analyze(query)
    doc_socres = {}
    for term in q_terms:
        # get the posting list for the current term
        postings_list = index_reader.get_postings_list(term, analyzer=None)
        if postings_list is not None:
            # get the document frequency of the current term
            df = df_dict[term]

            # iterate the posting list
            for posting in tqdm(postings_list, desc=f"Iterate posting for term '{term}'"):

                # Call the scoring function (you will implement these below).
                score = scorer(tf, df, doc_len)
                if docid in doc_socres.keys():
                    doc_socres[docid] += score
                else:
                    doc_socres[docid] = score

    # Sort the results by the score.
    results = [(docid, doc_socre) for docid, doc_socre in doc_socres.items()]
    results = sorted(results, key=lambda x: x[1], reverse=True)[:k]
    return results
    print("-----------------------------------------------------")

After this line, feel free to edit this notebook however you like. You can use the following cells as a template to guide you in completing this project. CourseNana.COM

In [ ]:

# Import all your python libraries and put setup code here.

Double-click to edit this markdown cell and describe the first method you are going to implement, e.g., BM25 CourseNana.COM

In [ ]:

# Put your implementation of BM25 here, including parameter tuning.

When you have described and provided implementations for each method, include a table with statistical analysis here. CourseNana.COM

For convenience, you can use tools like this one to make it easier: https://www.tablesgenerator.com/markdown_tables, or if you are using pandas, you can convert dataframes to markdown https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_markdown.html CourseNana.COM

Then you can edit this cell to describe the gain-loss plots you will create below. CourseNana.COM

In [ ]:

# Put your implementations for the gain-loss plots here.

Then you can edit this cell to provide some analysis about if there is a method that consistently outperform the others on the dev set and the test set. CourseNana.COM

Then you can edit this cell to provide insights of whether rank fusion works, or if it does not. CourseNana.COM

INFS7410 Project - Part 1: Classical information retrieval methods

INFS7410 Project - Part 1

Aim

Project Part 1 aim

The Information Retrieval Task: Web Passage Ranking

What we provide you with:

Files from practical

Extra files for this project

What you need to produce

Required methods to implement

Required evaluation to perform

How to submit

Leaderboard Challenge

Initialise packages and functions

Get in Touch with Our Experts