1. Homepage
  2. Programming
  3. COMP6714 Information Retrieval and Web Search - Project: Westlaw alike queries

COMP6714 Information Retrieval and Web Search - Project: Westlaw alike queries

Engage in a Conversation
AustralianUNSWUniversity Of New South WalesCOMP6714Information Retrieval and Web SearchWestlawSimplyBoolean

COMP6714 2022T3 Project: Westlaw alike queries CourseNana.COM

As presented in the lecture, Westlaw is a popular commercial information retrieval system. You can search for documents by Boolean Terms and Connector queries. For example: CourseNana.COM

STATUTE ACTION /S FEDERAL /2 TORT /3 CLAIM CourseNana.COM

where STATUTE, ACTION, FEDERAL, TORT, CLAIM are the search terms and space, /S, /2, /3 are the connectors. CourseNana.COM


CourseNana.COM

In this project, you are going to implement a retrieval system in Python3 called SimplyBoolean, which you have already encountered in the assignment. As a core requirement for this project, you must implement SimplyBoolean using a positional index. CourseNana.COM

SimplyBoolean is a retrieval system that supports Westlaw alike queries. It supports the following reduced set of connectors: CourseNana.COM

" ", space, +n, /n, +s, /s, & CourseNana.COM

as well as parentheses. Note that the connectors of your query will be processed in exactly the order above. Further details of these connectors can be found in the Quick Reference Guide of WestLaw available from WebCMS. CourseNana.COM

Different to Westlaw, SimplyBoolean does not support various forms of search terms, except a normal search term (i.e. single-word, those without " ") and a phrase. CourseNana.COM

Term matching (including terms in a phrase) in SimplyBoolean follows the below: CourseNana.COM

Search in SimplyBoolean is case insensitive.
Full stops for abbreviations are ignored. e.g., U.S., US are the same.
Singular/Plural is ignored. e.g., cat, cats, cat's, cats' are all the same.
Tense is ignored. e.g., breaches, breach, breached, breaching are all the same.
A sentence can only end with a full stop, a question mark, or an exclammation mark. Except the above, all other punctuation should be treated as token dividers.
All (whole) numeric tokens such as years, decimals, integers are ignored. You should not index these tokens and hence should not consider them for proximity queries such as +n. E.g. you should not index '123' (wholly numeric) but should index 'abc123' (partially numeric).
CourseNana.COM

You are provided with approximately 1000 small documents (named with their document IDs) available in ~cs6714/reuters/data. You can find these files by logging into CSE machines and going to folder ~cs6714/reuters/data. Your submitted project will be tested against a similar collection of up to 1000 documents (i.e., we may replace some of these documents to avoid any hard-coded solutions). CourseNana.COM

Your submission must include 2 main programs: index.py and search.py as described below. CourseNana.COM

The Indexer CourseNana.COM

$ python3 index.py [folder-of-documents] [folder-of-indexes] CourseNana.COM

where [folder-of-documents] is the path to the directory for the collection of documents to be indexed and [folder-of-indexes] is the path to the directory containing the index file(s) be created. All the files in [folder-of-documents] should be opened as read-only, as you may not have the write permission for these files. If [folder-of-indexes] does not exist, create a new directory as specified. You may create multiple index files although too many index files may slow down your performance. The total size of all your index files generated shall not exceed 20MB (which should be plenty for this project). CourseNana.COM

After the indexing is completed, it will output the total number of documents, the total number of tokens (after any preprocessing and filtering) to be indexed, and the total number of terms to be indexed. The following example illustrates the required input and output formats: CourseNana.COM

$ python3 index.py ~/Desktop/MyDataFolder ./MyTestIndex Total number of documents: 672
Total number of tokens: 638321
Total number of terms: 13297
CourseNana.COM

Note: the output of index.py ends with one newline ('\n') character. CourseNana.COM

Searching CourseNana.COM

$ python3 search.py [folder-of-indexes] CourseNana.COM

where [folder-of-indexes] is the path to the directory containing the index file(s) that are generated by the indexer. After the above command is executed, it will accept a search query from the standard input and output the result to the standard output as a sequence of document names (the same as their document IDs) one per line and sorted in an ascending order by their numeric values (e.g., 72 will be output before 125). It will then continue to accept the search queries from the standard input and output the results to the standard output until the end (i.e., a Ctrl-D). The following example illustrates the required input and output formats: CourseNana.COM

$ python3 search.py ~/Proj/MyTestIndex company inc & revenue
9
17
CourseNana.COM

33
185
share +5 investor & US 3
67
271
365
499
CourseNana.COM

625 $ CourseNana.COM

Chaining Mixed Connectors CourseNana.COM

Example: CourseNana.COM

a b /s c CourseNana.COM

Explanation: As per the WestLaw guide, following the precedence rules for this example, OR (the space) has higher priority. Because OR is a boolean connector, we can re-write the query into an equivalent form below: CourseNana.COM

(a /s c) (b /s c) CourseNana.COM

Chaining Non-boolean Connectors CourseNana.COM

Example: CourseNana.COM

a +n b /s c CourseNana.COM

Explanation: The connector precedence here lies with '+n' first, then '/s'. This query can be understood as the equivalent of doing (a +n b) first, then only among the documents (and more importantly, their postings), we will output the documents for which (a /s c) (for the same posting 'a') or (b /s c) (for the same posting 'b') is true. To further explain in english, the query wants documents where either: CourseNana.COM

there is 'a' which precedes 'b' by at most 'n' terms, that same occurrence of 'a' is in a sentence with 'c'
there is 'a' which precedes 'b' by at most 'n' terms, that same occurrence of 'b' is in a sentence with 'c'
CourseNana.COM

Another example: CourseNana.COM

a +n (b /s c) CourseNana.COM

Explanation: With the presence of parentheses, this query will have (b /s c) processed first. From the resulting document postings, we will output the documents for which (a +n b) (for the same posting 'b') or (a +n c) (for the same posting 'c') is true. To further explain in english, the query wants documents where either: CourseNana.COM

there is 'b' in a sentence with 'c', that same occurrence of 'b' occurs at most 'n' terms after 'a'
there is 'b' in a sentence with 'c', that same occurrence of 'c' occurs at most 'n' terms after 'a'
CourseNana.COM

Marking
This assignment is worth 30 points.
Your submission will be tested and marked on CSE linux machines using Python3. Therefore,
CourseNana.COM

please make sure you have tested your solution on these machines using Python3 before you submit. You will not receive any marks if your program does not work on CSE linux machines and only works in other environment such as your own laptop. CourseNana.COM

Full marks will be awarded to submissions that follow this specification and pass all the test cases. CourseNana.COM

Although we do not measure the runtime speed, your indexing program will be terminated if it does not end after one minute, and you will receive zero marks for the project (since we cannot get the index generated successfully for further testing); and your search program will be terminated if it does not end after 10 seconds per search query, and you will receive zero marks for that search query. CourseNana.COM

There will be test cases for each connector in addition to the test cases for mixing several connectors. Therefore, if you are unable to implement all the required connectors, try your best to implement as many as you can. CourseNana.COM

Submission
The penalty for late submission of assignments will be 5% (of the worth of the assignment) subtracted from the raw mark per day of being late. In other words, earned marks will be lost. No assignments will be accepted later than 5 days after the deadline. CourseNana.COM

Use the give command below to submit the assignment: CourseNana.COM

give cs6714 proj *.py CourseNana.COM

Make sure to use classrun to check your submission to make sure that you have submitted all the required files for index.py and search.py to run properly. CourseNana.COM

6714 classrun -check proj CourseNana.COM

  CourseNana.COM

Get in Touch with Our Experts

WeChat WeChat
Whatsapp WhatsApp
Australian代写,UNSW代写,University Of New South Wales代写,COMP6714代写,Information Retrieval and Web Search代写,Westlaw代写,SimplyBoolean代写,Australian代编,UNSW代编,University Of New South Wales代编,COMP6714代编,Information Retrieval and Web Search代编,Westlaw代编,SimplyBoolean代编,Australian代考,UNSW代考,University Of New South Wales代考,COMP6714代考,Information Retrieval and Web Search代考,Westlaw代考,SimplyBoolean代考,Australianhelp,UNSWhelp,University Of New South Waleshelp,COMP6714help,Information Retrieval and Web Searchhelp,Westlawhelp,SimplyBooleanhelp,Australian作业代写,UNSW作业代写,University Of New South Wales作业代写,COMP6714作业代写,Information Retrieval and Web Search作业代写,Westlaw作业代写,SimplyBoolean作业代写,Australian编程代写,UNSW编程代写,University Of New South Wales编程代写,COMP6714编程代写,Information Retrieval and Web Search编程代写,Westlaw编程代写,SimplyBoolean编程代写,Australianprogramming help,UNSWprogramming help,University Of New South Walesprogramming help,COMP6714programming help,Information Retrieval and Web Searchprogramming help,Westlawprogramming help,SimplyBooleanprogramming help,Australianassignment help,UNSWassignment help,University Of New South Walesassignment help,COMP6714assignment help,Information Retrieval and Web Searchassignment help,Westlawassignment help,SimplyBooleanassignment help,Australiansolution,UNSWsolution,University Of New South Walessolution,COMP6714solution,Information Retrieval and Web Searchsolution,Westlawsolution,SimplyBooleansolution,