CourseNana | FIT5196 Data wrangling - assessment 1: Extracting data from semi-structured text files

FIT5196-S1-2024 assessment 1 CourseNana.COM

This is a group assessment and worth 35% of your total mark for FIT5196. CourseNana.COM

Text documents, such as crawled web data, are usually composed of topically coherent text data, which within each topically coherent data, one would expect that the word usage demonstrates more consistent lexical distributions than that across the dataset. A linear partition of texts into topic segments can be used for text analysis tasks, such as passage retrieval in IR (information retrieval), document summarization, recommender systems, and learning-to-rank methods.
CourseNana.COM

Task 1: Parsing Raw Text Files(17/35) CourseNana.COM

This assessment touches the very first step of analysing textual data, i.e., extracting data from semi-structured text files. CourseNana.COM

Allowed libraries: re,json,pandas CourseNana.COM

Your group is provided with a unique dataset containing youtube comments. Please use the data file with your group_number, i.e. Group<group_number>.txt in the Google drive folder (student_data). Note: Using a wrong input dataset will result in ZERO marks for ‘Output’ in marking rubric. CourseNana.COM

Your dataset contains a subset of records on trademark assignments (please find your group file on the Google drive, i.e., Group<group_number>.txt). The trademark assignments are recorded with a set of attributes, e.g., reel no, frame no, assignors, assignees etc. Please check with the sample input file(sample_input_task1.txt) for all the available attributes. CourseNana.COM

Your task is to extract the data and transform it into the JSON format with the following elements: CourseNana.COM

● rf-id: a unique numerical ID for a trademark assignment entry. It is a combination of reel-no and frame-no CourseNana.COM
● last-update-date: An date identifies when the assignment record was last modified (output format YYYY-MM-DD in digits) CourseNana.COM

Input Files CourseNana.COM	Output Files (submission) CourseNana.COM
Group<group_number>.txt CourseNana.COM	task1_<group_number>.json task1_<group_number>.ipynb task1_<group_number>.py CourseNana.COM

● ● CourseNana.COM

● CourseNana.COM

conveyance-text: Contains textual description of the interest conveyed or transaction recorded.
correspondent-party: The name of a person or organisation to whom correspondence related to the assignment. Don’t need to reformat the name for this entity. CourseNana.COM

assignors-info CourseNana.COM

party-name CourseNana.COM

● CourseNana.COM

● ● ● ● CourseNana.COM

● CourseNana.COM

● ● CourseNana.COM

: a root element with one or more assignors, contains fields:
: Identifies the party (person or organisation name) conveying an CourseNana.COM

interest or transaction. If it is a person and has a title, remove the title. date-acknowledged: Date on which the supporting legal documentation was acknowledged. (output format YYYY-MM-DD in digits).
execution-date: Date on which the supporting legal documentation was executed. (output format YYYY-MM-DD in digits). CourseNana.COM

country: The party’s country location. Put “USA” as the value if the country name is a variant of USA and fill the value as “UK” if the country is a variant of the UK. When there is no explicit country-name, you can infer the country-name by state, when there is no explicit state, you can assume the value for nationality is the party’s country location. In the case of when a country is indecisive or you get a value of NOT PROVIDED, place “NA” as the value. CourseNana.COM

legal-entity-text: A textual description describing the party’s legal entity status. CourseNana.COM

: root element with one or more assignees, contains fields:
: Identifies the party (person or organisation name) receiving an CourseNana.COM

assignees-info CourseNana.COM

party-name CourseNana.COM

interest or transaction. If it is a person and has a title, remove the title. country: The party’s country location. Put “USA” as the value if the country name is a variant of USA and fill the value as “UK” if the country is a variant of the UK. When there is no explicit country-name, you can infer the country-name by state, when there is no explicit state, you can assume the value for nationality is the party’s country location. In the case of when a country is indecisive or you get a value of NOT PROVIDED, place “NA” as the value.
legal-entity-text: A textual description describing the party’s legal entity status CourseNana.COM

●
● property-count:numberofpropertiesforaatrademarkassignmententry CourseNana.COM

Note: CourseNana.COM

If any field is empty, put ‘NA’ as the value CourseNana.COM
All the tag names are case-sensitive in the output XML file. You can refer to the CourseNana.COM

sample sample_output_task1.json for the correct XML file structure. CourseNana.COM
The output, methodology, and documentation will be marked separately for this task. CourseNana.COM

Task 1 Guidelines CourseNana.COM

To complete the above task, please follow the steps below: CourseNana.COM

Step 0: Study the sample files CourseNana.COM

● Open and check your input txt file and find patterns for different data elements CourseNana.COM

● Use other online web applications such as xmlviewer to better understand the structure of the XML sample output. CourseNana.COM

Step 1: Txt file parsing CourseNana.COM

● Use python library to parse .txt file CourseNana.COM
● Use Regex to extract the required attributes and their values as listed above CourseNana.COM

Step 2: Further process the extracted text from Step 1 CourseNana.COM

● Convert the XML special characters (eg, & to &) CourseNana.COM
● Save the data into a proper data format e.g. dataframe, dictionary... CourseNana.COM

Step 3: JSON file output CourseNana.COM

● Use python library to transfer your data in step 2 into proper JSON format (make sure you check the spelling, upper/lower case, key names and name hierarchy of your JSON data) CourseNana.COM

Submission Requirements CourseNana.COM

You need to submit 3 files: CourseNana.COM

● A task1_<group_number>.json file contains the correct review information with CourseNana.COM

all the elements listed above. CourseNana.COM
● A Python notebook named task1_<group_number>.ipynb contains a CourseNana.COM

well-documented report that demonstrates your solution to Task 1. You need to clearly present the methodology, that is, the entire step-by-step process of your solution with appropriate comments and explanations. You can follow the suggested steps in the guideline above. Please keep this notebook easy-to-read, as you will lose marks if we cannot understand it. (make sure you PRINT OUT your cell output) CourseNana.COM
● A task1_<group_number>.py file. This file will be used for plagiarism check. (make sure you clear your cell output before exporting) CourseNana.COM

In Google colab: CourseNana.COM

Requirements on the Python notebook (report) CourseNana.COM

● Methodology - 25% CourseNana.COM

○ You need to demonstrate your solution using correct regular expressions. CourseNana.COM

Results from each step could help to demonstrate your solution better and CourseNana.COM

be easier to understand. CourseNana.COM
○ You should present your solution in a proper way including all required CourseNana.COM

steps. Skip any steps will cause a penalty on grade. CourseNana.COM
○ You need to select and use the appropriate Python functions for input, CourseNana.COM

process and output. CourseNana.COM
○ Your solution should be an efficient one without redundant operations and CourseNana.COM

unnecessary reading and writing the data. CourseNana.COM

● Report CourseNana.COM

○ The report should be organised in a proper structure to present your CourseNana.COM

solutions to Task 1 with clear and meaningful titles for sections and CourseNana.COM

subsections or sub-subsection if needed. CourseNana.COM
○ Each step in your solution should be clearly described. For example, you CourseNana.COM

can write to explain your idea of the solution, any specific settings, and the CourseNana.COM

reason for using a particular function, etc. CourseNana.COM
○ Explanation of your results including all intermediate steps is required. CourseNana.COM

This can help the marking team to understand your solution and give CourseNana.COM

partial marks if the final results are not fully correct. CourseNana.COM
○ All your codes need proper (but not excessive) commenting. CourseNana.COM
○ You can refer to the notebook templates provided as a guideline for a CourseNana.COM

properly formatted notebook report. CourseNana.COM

Task 2: Text Pre-Processing (16/35) CourseNana.COM

This task touches on the next step of analysing textual data, converting the extracted text data into a numerical representation, thus it can be used for a downstream modelling task. In this task, you are required to write Python code to pre-process a set of youtube comments (in an Excel file) and convert them into numerical representations. The numerical representation is the standard format of text data when (which are suitable for input into NLP systems such as: recommender-systems, information-retrieval algorithms, machine-translation etc.). The most basic step for natural language processing (NLP) tasks is to convert words into numbers for machines to understand & decode patterns within a language. This step, though iterative, plays a significant role in deciding features for your machine learning model/algorithm. CourseNana.COM

Allowed libraries: ALL CourseNana.COM

organisation and writing - 25% CourseNana.COM

Input Files CourseNana.COM	Output Files (submission) CourseNana.COM
Group<group_number>.xlsx CourseNana.COM	<group_number>_channel_list.csv CourseNana.COM <group_number>_vocab.txt <group_number>_countvec.txt task2_<group_number>.ipynb task2_<group_number>.py CourseNana.COM

Your group is provided with a unique dataset containing youtube comments (see sample_input_task2.xlsx). Please use the data file with your group_number, i.e. <group_number>.xlsx in the Google drive folder (student_data). The excel file contains worksheets with many youtube comments data. The excel tables have two columns: CourseNana.COM

● id: Unique comment identifier CourseNana.COM
● Snippet: a JSON array that contains information about one top level comment for a CourseNana.COM

particular youtube video, such as the comment text, the channel and video the comment is under. CourseNana.COM

You are asked to extract the ‘textOriginal’ fields in all top level comments for all youtube video lists that we provide to you. Then pre-process the abstract text and generate a vocabulary list and numerical representation for the corresponding text, which will be used in the model training by your colleagues. The information regarding output files is listed below: CourseNana.COM

● <group_number>_channel_list.csv contains unique channel ids along with the counts of top level comments(all language, and English only). CourseNana.COM
● <group_number>_vocab.txt comprises unique stemmed tokens sorted alphabetically, presented in the format of token_index:token, as outlined in Guideline step 4. CourseNana.COM
● <group_number>_countvec.txt includes numerical representations of all tokens, organised by channels_id and token index, following the format channel_id, token_index:frequency, as outlined in Guideline step 5. CourseNana.COM

Carefully examine the sample files (here) for detailed information about the output structure. CourseNana.COM

VERY IMPORTANT NOTE: The sample outputs are just for you to understand the structure of the required output and the correctness of their content in task 2 is not guaranteed. So please do not try to reverse engineer the outputs as it will fail to generate the correct content. CourseNana.COM

Task 2 Guideline CourseNana.COM

To complete the above task, please follow the steps below: CourseNana.COM

Step 1: Data import CourseNana.COM

● Each excel file contains multiple worksheets. Data are positioned differently in each worksheet. CourseNana.COM
● You are required to combine all data together and remove any duplicates to perform the next step CourseNana.COM

Step 2: Text extraction and cleaning CourseNana.COM

● You are required to extract the ‘textOriginal’ fields in all top level comments. CourseNana.COM
● Since the comment data contain emojis, you will need to remove the emojis and CourseNana.COM

normalise the text into lower case for further analysis. CourseNana.COM
- ○ To remove emojis, make sure your text data is in utf-8 format. CourseNana.COM
- ○ The list of emojis to remove are in emoji.txt. CourseNana.COM
● You only require to extract the vocab and countvec lists for english comments from Channels that have at least 15 english comments by using the CourseNana.COM

Step 3: Generate csv file CourseNana.COM

● Generate a csv file that contains unique channel ids along with the counts of top level comments(all language, and english). CourseNana.COM
● The column names are: channel_id, all_comment_count, eng_comment_count. CourseNana.COM

Step 4: Generate the unigram and bigram lists and output as vocab.txt CourseNana.COM

● The following steps must be performed (not necessarily in the same order) to complete the assessment. Please note that the order of preprocessing matters and will result in different vocabulary and hence different count vectors. It is part of the assessment to figure out the correct order of preprocessing which makes the most sense as we learned in the unit. You are encouraged to ask questions and discuss them with the teaching team if in doubt. CourseNana.COM
1. The word tokenization must use the following regular expression, "[a-zA-Z]+" CourseNana.COM
2. The context-independent and context-dependent stopwords must be removed CourseNana.COM
  
  from the vocabulary. CourseNana.COM
  - For context-independent, The provided context-independent stop CourseNana.COM
    
    words list (i.e, stopwords_en.txt) must be used. CourseNana.COM
  - For context-dependent stopwords, you must set the threshold to CourseNana.COM
    
    words that appear in more than 99% of channels_ids that have at CourseNana.COM
    
    least 15 english comments. CourseNana.COM
3. Tokens should be stemmed using the Porter stemmer. CourseNana.COM
4. Rare tokens must be removed from the vocab (with the threshold set to be CourseNana.COM
  
  words appear in less than 1% channels_ids that have at least 15 english CourseNana.COM
  
  comments . CourseNana.COM
5. Tokens with a length less than 3 should be removed from the vocab. CourseNana.COM
6. First 200 meaningful bigrams (i.e., collocations) must be included in the CourseNana.COM
  
  vocab using PMI measure, then makes sure the collocations can be CourseNana.COM
  
  collocated within the same comment. CourseNana.COM
7. Calculate the vocabulary containing both unigrams and bigrams. CourseNana.COM
● Combine the unigrams and bigrams, sort the list alphabetically in an ascending order and output as vocab.txt CourseNana.COM

Step 5: Generate the sparse numerical representation and output as countvec.txt CourseNana.COM

1. Re-tokenize your text based on the bigram list generated in step 4 if necessary. CourseNana.COM

langdetect library CourseNana.COM

with DetectorFactory.seed = 0. Note you are deciding the language on a comment CourseNana.COM

level, not a sentence level. CourseNana.COM

Generate sparse representation by using the countvectorizer() function OR directly count the frequency using FreqDist(). CourseNana.COM
Mapping the generated token with the vocabs in step 4 if need CourseNana.COM
Output the sparse numerical representation into txt file with the following format: CourseNana.COM

Submission Requirements CourseNana.COM

You need to submit 5 files: CourseNana.COM

A <group_number>_channel_list.csv file contains information about the top level comment count (all language, and english) for each channel id CourseNana.COM
A that contains the unigrams and bigrams tokens in the following format, Words in the vocabulary must be sorted in alphabetical order. CourseNana.COM
A <group_number>_countvec.txt file, in which each line contains the sparse representations of one of the channel in the following format: CourseNana.COM

channel_id, token1_index:token1_wordcount, token2_index:token2_wordcount, ... CourseNana.COM

Please note: the tokens with zero word count should NOT be included in the sparse CourseNana.COM

representation. CourseNana.COM
A task2_<group_number>.ipynb file that contains your report explaining the code CourseNana.COM

and the methodology. (make sure you PRINT OUT your cell outputs) CourseNana.COM
A task2_<group_number>.py file for plagiarism checks. (make sure you clear your CourseNana.COM

cell outputs) CourseNana.COM

Requirements on the Python notebook (report) CourseNana.COM

● Methodology - 25% CourseNana.COM

○ You need to demonstrate your solution using correct regular expressions. CourseNana.COM
○ You should present your solution in a proper way including all required CourseNana.COM

steps. CourseNana.COM
○ You need to select and use the appropriate Python functions for input, CourseNana.COM

process and output. CourseNana.COM
○ Your solution should be an efficient one without redundant operations and CourseNana.COM

unnecessary reading and writing the data. CourseNana.COM

channel_id1,token1_index:token1_frequency, CourseNana.COM

token2_index:token2_frequency, token3_index:token3_frequency, ... CourseNana.COM

channel_id2,token2_index:token2_frequency, CourseNana.COM

token5_index:token5_frequency, token7_index:token7_frequency, ... CourseNana.COM

channel_id3,token6_index:token6_frequency, CourseNana.COM

token9_index:token9_frequency, token12_index:token12_frequency, ... CourseNana.COM

<group_number>_vocab.txt CourseNana.COM

token:token_index. CourseNana.COM

● Report CourseNana.COM

○ The report should be organised in a proper structure to present your CourseNana.COM

solutions to Task 2 with clear and meaningful titles for sections and CourseNana.COM

subsections or sub-subsection if needed. CourseNana.COM
○ Each step in your solution should be clearly described. For example, you CourseNana.COM

can write to explain your idea of the solution, any specific settings, and the reason for using a particular function, etc. CourseNana.COM

organisation and writing - 25% CourseNana.COM

○ Explanation of your results including all intermediate steps is required. This can help the marking team to understand your solution and give partial marks if the final results are not fully correct. CourseNana.COM
○ All your codes need proper (but not excessive) commenting. CourseNana.COM
○ You can refer to the notebook templates provided as a guideline for a CourseNana.COM

properly formatted notebook report. CourseNana.COM

Task 3: Development History (2/35) CourseNana.COM

For this task, your group is required to provide a comprehensive development history of your assignment, showcasing incremental progress over at least three different time points. The purpose of this task is to demonstrate your ability to manage and document the evolution of your project, including changes made, challenges faced, and collaborative efforts with your group mates. CourseNana.COM

Submission Requirements CourseNana.COM

Here are the key components you need to include in your submission(example): CourseNana.COM

1. Development Timeline: Provide a detailed timeline highlighting at least three significant time points in the development of your assignment. Each time point should be accompanied by a description of the changes made and the rationale behind those changes. CourseNana.COM

2. Version Screenshots: Include screenshots or snapshots of different versions of your assignment at each time point. This should clearly illustrate the incremental development and any modifications made to the project. CourseNana.COM

3. Collaborative Effort(If you are doing the assignment with another student): CourseNana.COM

Document the collaborative effort with your group mates. This can be a description of the contributions made by each team member or screenshots of proof showcasing the collaborative effort with your group mates. CourseNana.COM

Note: We only require brief descriptions here, they don’t have to be long as long as they reflect the key components we listed above. We recommend you to use Google Colab to complete your assignment, as Google Colab notebooks provide a comprehensive version history. CourseNana.COM

Output Files (submission) CourseNana.COM

<group_number>_development_history_task1.pdf <group_number>_development_history_task2.pdf CourseNana.COM

Instructions- Sharing ipynb link CourseNana.COM

Click on the Share button on the top right corner CourseNana.COM
Make sure under General access section, you have the permission sets to ‘Monash University’ and ‘Editor’, then click on `Copy link` CourseNana.COM
Create a markdown cell at the end of your assignment. Paste the sharelink/create a hyperlink object. CourseNana.COM

4. Double check the link to make sure it is working. CourseNana.COM

FIT5196 Data wrangling - assessment 1: Extracting data from semi-structured text files

Get in Touch with Our Experts