INFS7410 Information Retrieval and Web Search - Project Part 1 - Web Passage Ranking

Engage in a Conversation

INFS7410 Project - Part 1

Preamble

The due date for this assignment is 9 September 2022 16:00 Eastern Australia Standard Time. CourseNana.COM

This part of project is worth 20% of the overall mark for INFS7410 (part 1 + part 2 = 40%). A detailed marking sheet for this assignment is provided alongside this notebook. The project is to be completed individually. CourseNana.COM

We recommend that you make an early start on this assignment, and proceed by steps. There are a number of activities you may have already tackled, including setting up the pipeline, manipulating the queries, implement some retrieval functions, and performing evaluation and analysis. Most of the assignment relies on knowledge and code you should have already have experienced in the computer practicals, however there are some hidden challenges here and there that you may require some time to solve. CourseNana.COM

Aim

Project aim: The aim of this project is to implement a number of classical information retrieval methods, evaluate them and compare them in the context of a real use-case. CourseNana.COM

Project Part 1 aim

The aim of Part 1 is to: CourseNana.COM

Setup your infrastructure to index the collection and evaluate queries.
Implement common information retrieval baselines.
Tune your retrieval implementations to improve their effectiveness.
Implement rank fusion methods.

The Information Retrieval Task: Web Passage Ranking

In this project we will consider the problem of open-domain passage ranking in answer to web queries. In this context, users pose queries to the search engine and expect answers in the form of a ranked list of passages (maximum 1000 passages to be retrieved). CourseNana.COM

The provided queries are real queries submitted to the Microsoft Bing search engine. In the collection, there are approximately 8.8 million passages and the goal is to rank them based on their relevance to the queries. CourseNana.COM

What we provide you with:

Files from practical

A collection of 8.8 million text passages extracted from web pages (collection.tsv— provided in Week 1).
A query file that contains 43 queries for you to perform retrieval experiments (queries.tsv— provided in Week 2).
A qrel file that contains relevance judgements for you to tune your methods (qrels.txt— provided in Week 2).

Extra files for this project

A leaderboard system for you to evaluate how well your system performs.
A test query file that contains 54 queries for you to generate run files to submit to the leaderboard (test_queries.tsv).
This jupyter notebook, which you will include inside it your implementation and report.

Put this notebook and provided files under the same directory. CourseNana.COM

What you need to produce

You need to produce: CourseNana.COM

Correct implementations of the methods required by this project specifications.
An explanation of the retrieval methods used, including the formulas that represent the models you implemented and code that implements that formula, an explanation of the evaluation settings followed, and a discussion of the findings.

You are required to produce both of these within this jupyter notebook. CourseNana.COM

Required methods to implement

In Part 1 of the project you are required to implement the following retrieval methods. All implementations should be based on your own code. CourseNana.COM

BM25: Create your own implementation, do not use the Pynserini API's implementation. See the videos in Week 3 for background information.
Pseudo-relevance feedback using BM25 for query expansion: create your own implementation using the Pyserini API to extract index statistics. See Week 5 practical for background information.
IDF-r query reduction: create your own implementation using the Pyserini API to extract index statistics. See Week 5 practical for background information.
The rank fusion method Borda; you need to create your own implementation of this. See Week 4 practical for background information.
The rank fusion method CombSUM; you need to create your own implementation of this. See Week 4 practical for background information.
The rank fusion method CombMNZ; you need to create your own implementation of this. See Week 4 practical for background information.

For your BM25, query expansion and query reduction implementations, you are also required to tune the parameters of these methods. You must perform a parameter search over at least 5 sensibly chosen parameter values depending on the method (10 when the method has two parameters). CourseNana.COM

For the rank fusion methods, consider fusing the highest performing tuned run from each of the BM25, query expansion and query reduction implementations. CourseNana.COM

You should have already attempted many of these implementations above as part of the computer pracs exercises. CourseNana.COM

Required evaluation to perform

In Part 1 of the project you are required to perform the following evaluation: CourseNana.COM

For all methods, tune using queries.tsv and qrels.txt (i.e., use this data to tune any parameter of a retrieval model, e.g. $b$ and $k1$ for BM25, etc.) and submit your runs on the test_queries.tsv using the parameter values you selected from the queries.tsv to the learderboard system.
Report the results of every method on the queries.tsv (only for the run you selected the tuned paramters from) separately, into a table. Perform statistical significance analysis across the results of the methods and report them in the tables.
Produce a gain-loss plot that compares BM25 vs. Pseudo-relevance feedback query expansion using BM25; and plots that compare BM25 vs. each rank fusion method on the dev set.
Comment on trends and differences observed when comparing your findings. Is there a method that consistently outperform the others on the queries.tsv and the test_queries.tsv?
Provide insights of whether rank fusion works, or if it does not, e.g., with respect to runs to be considered in the fusion process, queries, etc.

In terms of evaluation measures, evaluate the retrieval methods with respect to nDCG at 10 (ndcg_cut_10). You should use this measure as the target measure for tuning. Also compute reciprocal rank at 1000 (recip_rank), MAP (map) and Recall at 1000 (recall_1000). CourseNana.COM

For all gain-loss plots, produce them with respect to nDCG at 10. CourseNana.COM

For all statistical significance analysis, use paired t-test; distinguish between p<0.05 and p<0.01. CourseNana.COM

How to submit

You will have to submit one file: CourseNana.COM

A zip file containing this notebook (.ipynb) and this notebook as a PDF document. The code should be able to be executed by us. Remeber to include all your discussion and analysis also in this notebook and not as a separate file.

It needs to be submitted via the relevant Turnitin link in the INFS7410 BlackBoard site, by 3 September 2021, 16:00 Eastern Australia Standard Time, unless you have been given an extension (according to UQ policy), before the due date of the assignment. CourseNana.COM

Leaderboard Challenge

As part of this project, we present you the leaderboard challenge, where you can submit your best runs and compare them with others. You can submit multiple times during the whole process of your project. CourseNana.COM

For models, you can use any retrieval or rerank or fusion methods (including those you implement for the pracs and for the assessment), even combine multiple models together, not limited to the models we have learned in this class. You are allowed to be creative and come up with your own retrieval or rerank models. CourseNana.COM

Although multiple run files can be submitted, we recommend you to run your models with the queries we provided in pracs first, only to check if your model is valid. Then you can run on the test queries for your project, and submit it to leaderboard. CourseNana.COM

The submission link is: https://infs7410.uqcloud.net/leaderboard/ CourseNana.COM

After opening the link, you will see the main page as below: CourseNana.COM

main CourseNana.COM

The submitted runs will appear here. You can see we have already submitted a sample run as pyserini_BM25. The name of the run is in the format of student name(runtag), so make sure you name your run correctly in the run file's 6th column. If you use the same run tag for all the runs you submit, the latter ones will overwrite the previous runs: you will only have one run at the end. So to submit different runs, make sure to use different run tags. CourseNana.COM

If you want to make a submission, type the password in the textbox, and hit enter. (The password is your student number) It will take you to your private submission page to submit your run, as below: CourseNana.COM

submit CourseNana.COM

In this page, choose the run file you wish to upload by clicking the Browse button, then click Submit to submit the run. CourseNana.COM

We highly encourage you to submit your best runs to the leaderboard, so you will know how your model performs compared to your classmates. CourseNana.COM

We will also give prizes to students who have the top 3 best runs. More specifically: CourseNana.COM

Leaderboard winners: Certificate of award to the students with the best 3 runs (winner, + 2 runner-up) in the leaderboard (test queries) by the time the assessment closes (end of semester). +2% final mark provided to the student whose run has the highest effectiveness (ndcg_cut_10), +1% final mark to the student whose run has the second highest effectiveness (ndcg_cut_10) in the leaderboard by the end of the semester. A/Prof. Guido can provide recommendation letters to students with top 3 runs. CourseNana.COM

This leaderboard will be running throughout the semester. CourseNana.COM

Have fun! CourseNana.COM

Initialise packages and functions

You will need to decide which index, stemming algorithm and keeping stopwords or not in the following cell. You may want to try out different indexes and if you don't have one, follow week1 prac to create one (remember to add -storeDocvectors). CourseNana.COM

In [ ]:

stemming = None    # None or 'poter' or anything else
stopwords = False  # False or True
index = 'indexes/lucene-index-msmarco-passage-xxxx/'

Run the following cell to load and cache some useful packages and statistics that you will use later. CourseNana.COM

In [ ]:

from pyserini.search import SimpleSearcher
from pyserini.analysis import Analyzer, get_lucene_analyzer
from pyserini.index import IndexReader
from tqdm import tqdm

lucene_analyzer = get_lucene_analyzer(stemming=stemming, stopwords=stopwords)
analyzer = Analyzer(lucene_analyzer)
searcher = SimpleSearcher(index)
searcher.set_analyzer(lucene_analyzer)
index_reader = IndexReader(index)


# Create document frequency dictionary to speed up scoring later, this will take around 2 min.
df_dict = {}
for term in tqdm(index_reader.terms(), desc="loading idf dictionary:"):
        df_dict[term.term] = term.df

# cache document length and docids for the collection, this will take around 2 mins.
doc_len_dict = {}
doc_id_dict = {}
with open ('collection/collection.tsv', 'r') as f:
    lines = f.readlines()
    for line in tqdm(lines, desc="loading doc_length dictionary:"):
        docid, text = line.split('\t')
        doc_len_dict[docid] = len(text.split())
        internal_id = index_reader.convert_collection_docid_to_internal_docid(docid)
        doc_id_dict[internal_id] = docid

Understand and run the following cell to define the search function. CourseNana.COM

NOTE: This search function is different from the search function in week3 prac. When you implement methods yourself, make sure to use this search function which we implemented with iterating posting lists, do not use the search function from Week 3 prac, as the Week 3 prac is only re-ranking BM25 results. CourseNana.COM

In [ ]:

def search(query: str, k: int=1000, scorer=None):
    """
    Inputs: 
        query (str): the query string to perform the search.
        k (int): the number of documents to be returned.
        scorer: your implemented scoring function, such as bm25.
    
    Output:
        results (list): the sorted result list, a list of tuples.
                        The first element in the tuples is the docid, 
                        the second is the doc score.
    """
    
    assert scorer is not None
    print("-----------------------------------------------------")
    print("Current query:", query)

    # get the analyzed term list
    q_terms = analyzer.analyze(query)
    doc_socres = {}
    for term in q_terms:
        # get the posting list for the current term
        postings_list = index_reader.get_postings_list(term, analyzer=None)
        if postings_list is not None:
            # get the document frequency of the current term
            df = df_dict[term]

            # iterate the posting list
            for posting in tqdm(postings_list, desc=f"Iterate posting for term '{term}'"):
                internal_id = posting.docid
                # convert pyserini internal docid to the actual docid
                docid = doc_id_dict[internal_id]
                tf = posting.tf
                doc_len = doc_len_dict[docid]

                # Call the scoring function (you will implement these below).
                score = scorer(tf, df, doc_len)
                if docid in doc_socres.keys():
                    doc_socres[docid] += score
                else:
                    doc_socres[docid] = score

    # Sort the results by the score.
    results = [(docid, doc_socre) for docid, doc_socre in doc_socres.items()]
    results = sorted(results, key=lambda x: x[1], reverse=True)[:k]
    return results
    print("-----------------------------------------------------")

After this line, feel free to edit this notebook however you like. You can use the following cells as a template to guide you in completing this project. CourseNana.COM

In [ ]:

# Import all your python libraries and put setup code here.

Double-click to edit this markdown cell and describe the first method you are going to implement, e.g., BM25 CourseNana.COM

In [ ]:

# Put your implementation of BM25 here, including parameter tuning.

When you have described and provided implementations for each method, include a table with statistical analysis here. CourseNana.COM

For convenience, you can use tools like this one to make it easier: https://www.tablesgenerator.com/markdown_tables, or if you are using pandas, you can convert dataframes to markdown https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_markdown.html CourseNana.COM

Then you can edit this cell to describe the gain-loss plots you will create below. CourseNana.COM

In [ ]:

# Put your implementations for the gain-loss plots here.

Then you can edit this cell to provide some analysis about if there is a method that consistently outperform the others on the dev set and the test set. CourseNana.COM

Then you can edit this cell to provide insights of whether rank fusion works, or if it does not. CourseNana.COM