INFS2044 Assignment 2 Case Study
In this assignment you will be developing a system for extracting information from text files. The system will process text files, compute statistics about each file, and produce output in several different formats. Statistics computed by the system include the length of the file in words and the most frequent words in each file and their frequency.
Use Cases
The system supports a single use case:
UC1 Compute Summary Statistics:
The user specifies the files to be processed using a command line application, specifies the number of most frequent words to be identified, and specifies one or more output file names. The application reads each file, computes the summary statistics, and writes the result to the given output file(s) in the appropriate output format(s).
Summary statistics to be generated in this use case:
· Number of words in each file
· Most frequent N words in each file
The output format for each output file is determined by the file extension of each output file. Extensions and output files to be supported in this use case:
· .txt: Plain Text format.
· .csv: Comma-separated format.
The specification of these file formats is given in section “File Formats” later in this document.
Future variations:
· Additional output formats could be introduced.
· Additional summary statistics could be introduced.
· Filters for text processing could be introduced (such as different ways of identifying words in a file, ignoring selected words, etc)
· The functions may be eventually offered via a REST API in addition to the console application.
These variations are not in scope for your implementation in this assignment, but your design must be able to accommodate these extensions.
Example Command
The following command would find the top 10 most frequent words in each of the files a.txt, b.txt, and c.txt, and output the results in two formats: CSV format in out1.csv, and text format in out2.txt: (type the command all on one line)
$ python word_statistics_app.py --number=10 --output=out1.csv --output=out2.txt a.txt b.txt c.txt
If the number of frequent words specified on the command line exceeds the total number of unique words in a file, then output the actual number of unique words and their frequencies.
File Formats
Text format (extension .txt):
Each line shows a short text containing the name of the file, the total length of the file in words, followed by the most frequent words and their frequencies in the file (in order of descending frequency; if there are multiple words with identical frequency, show them in ascending alphabetic order).
Example:
Suppose that file a.txt has 47 words in total (some of which may be occurrences of the same word), and that the most frequent words in that file are “the” (frequency 10),“cat” (frequency8), “a” (frequency 8),apple(frequency 4).
The corresponding line in the output file would be (all on one line):
File a.txt contains 47 words. Frequent words are: the (10), a (8), cat (8), apple (4).
CSV format (extension .csv):
The information shown is the same as for the text format, except that the file name and statistics are delimited by commas.
For the above example, the row in the file would be:
a.txt,47,the,10,a,8,cat,8,apple,4.
You can assume that the file name does not contain commas and quotation marks.
Input File Format
All input files are plain text files. Each file may contain one or more lines of text. Words are delimited by one or more whitespaces (that is, space, tab, or newline characters).
Decomposition
You must use the following component decomposition as the basis for your implementation design:
The responsibilities of the elements are as follows:
Elements | Responsibilities |
Console App | Interact with the user (acquire user options) |
Word Stats Manager | Orchestrates the use case process (reading, tokenising, summarising, formatting, outputting) |
Summarising Engine | Computes the summary statistics |
Tokenising Engine | Splits the input text into tokens (words) |
Formatting Engine | Generates output from summaries |
File Access | Interacts with the file system to read & write files |
Scope
Your implementation must respect the boundaries defined by the decomposition and include classes for each of the elements in this decomposition.
The implementation must:
• run on python 3.10, and
• correctly implement the functions described in this document, and
• it must function correctly with any given plain text file (you can assume that the entire content of this file fits into main memory), and
• it must include a comprehensive unit test suite using pytest.
Focus your attention on the quality of your code.