1. Homepage
  2. Programming
  3. FIT5196 Data wrangling - assessment 1: Extracting data from semi-structured text files

FIT5196 Data wrangling - assessment 1: Extracting data from semi-structured text files

Engage in a Conversation
MonashFIT5196Data wranglingPythonData Extraction

FIT5196-S1-2024 assessment 1 CourseNana.COM

This is a group assessment and worth 35% of your total mark for FIT5196. CourseNana.COM

Text documents, such as crawled web data, are usually composed of topically coherent text data, which within each topically coherent data, one would expect that the word usage demonstrates more consistent lexical distributions than that across the dataset. A linear partition of texts into topic segments can be used for text analysis tasks, such as passage retrieval in IR (information retrieval), document summarization, recommender systems, and learning-to-rank methods.
CourseNana.COM

Task 1: Parsing Raw Text Files(17/35) CourseNana.COM

This assessment touches the very first step of analysing textual data, i.e., extracting data from semi-structured text files. CourseNana.COM

Allowed libraries: re,json,pandas CourseNana.COM

Your group is provided with a unique dataset containing youtube comments. Please use the data file with your group_number, i.e. Group<group_number>.txt in the Google drive folder (student_data). Note: Using a wrong input dataset will result in ZERO marks for ‘Output’ in marking rubric. CourseNana.COM

Your dataset contains a subset of records on trademark assignments (please find your group file on the Google drive, i.e., Group<group_number>.txt). The trademark assignments are recorded with a set of attributes, e.g., reel no, frame no, assignors, assignees etc. Please check with the sample input file(sample_input_task1.txt) for all the available attributes. CourseNana.COM

Your task is to extract the data and transform it into the JSON format with the following elements: CourseNana.COM

  • ●  rf-id: a unique numerical ID for a trademark assignment entry. It is a combination of reel-no and frame-no CourseNana.COM

  • ●  last-update-date: An date identifies when the assignment record was last modified (output format YYYY-MM-DD in digits) CourseNana.COM

Input Files CourseNana.COM

Output Files (submission) CourseNana.COM

Group<group_number>.txt CourseNana.COM

task1_<group_number>.json task1_<group_number>.ipynb task1_<group_number>.py CourseNana.COM

conveyance-text: Contains textual description of the interest conveyed or transaction recorded.
correspondent-party: The name of a person or organisation to whom correspondence related to the assignment. Don’t need to reformat the name for this entity. CourseNana.COM

assignors-info CourseNana.COM

party-name CourseNana.COM

● ● ● ● CourseNana.COM

CourseNana.COM

● ● CourseNana.COM

: a root element with one or more assignors, contains fields:
: Identifies the party (person or organisation name) conveying an CourseNana.COM

interest or transaction. If it is a person and has a title, remove the title. date-acknowledged: Date on which the supporting legal documentation was acknowledged. (output format YYYY-MM-DD in digits).
execution-date: Date on which the supporting legal documentation was executed. (output format YYYY-MM-DD in digits). CourseNana.COM

country: The party’s country location. Put “USA” as the value if the country name is a variant of USA and fill the value as “UK” if the country is a variant of the UK. When there is no explicit country-name, you can infer the country-name by state, when there is no explicit state, you can assume the value for nationality is the party’s country location. In the case of when a country is indecisive or you get a value of NOT PROVIDED, place “NA” as the value. CourseNana.COM

legal-entity-text: A textual description describing the party’s legal entity status. CourseNana.COM

: root element with one or more assignees, contains fields:
: Identifies the party (person or organisation name) receiving an CourseNana.COM

assignees-info CourseNana.COM

party-name CourseNana.COM

interest or transaction. If it is a person and has a title, remove the title. country: The party’s country location. Put “USA” as the value if the country name is a variant of USA and fill the value as “UK” if the country is a variant of the UK. When there is no explicit country-name, you can infer the country-name by state, when there is no explicit state, you can assume the value for nationality is the party’s country location. In the case of when a country is indecisive or you get a value of NOT PROVIDED, place “NA” as the value.
legal-entity-text: A textual description describing the party’s legal entity status CourseNana.COM


property-count:numberofpropertiesforaatrademarkassignmententry CourseNana.COM

  1. If any field is empty, put ‘NA’ as the value CourseNana.COM

  2. All the tag names are case-sensitive in the output XML file. You can refer to the CourseNana.COM

    sample sample_output_task1.json for the correct XML file structure. CourseNana.COM

  3. The output, methodology, and documentation will be marked separately for this task. CourseNana.COM

Task 1 Guidelines CourseNana.COM

To complete the above task, please follow the steps below: CourseNana.COM

Step 0: Study the sample files CourseNana.COM

● Open and check your input txt file and find patterns for different data elements CourseNana.COM

● Use other online web applications such as xmlviewer to better understand the structure of the XML sample output. CourseNana.COM

Step 1: Txt file parsing CourseNana.COM

  • ●  Use python library to parse .txt file CourseNana.COM

  • ●  Use Regex to extract the required attributes and their values as listed above CourseNana.COM

    Step 2: Further process the extracted text from Step 1 CourseNana.COM

● Use python library to transfer your data in step 2 into proper JSON format (make sure you check the spelling, upper/lower case, key names and name hierarchy of your JSON data) CourseNana.COM

Submission Requirements CourseNana.COM

You need to submit 3 files: CourseNana.COM

  • ●  A task1_<group_number>.json file contains the correct review information with CourseNana.COM

    all the elements listed above. CourseNana.COM

  • ●  A Python notebook named task1_<group_number>.ipynb contains a CourseNana.COM

    well-documented report that demonstrates your solution to Task 1. You need to clearly present the methodology, that is, the entire step-by-step process of your solution with appropriate comments and explanations. You can follow the suggested steps in the guideline above. Please keep this notebook easy-to-read, as you will lose marks if we cannot understand it. (make sure you PRINT OUT your cell output) CourseNana.COM

  • ●  A task1_<group_number>.py file. This file will be used for plagiarism check. (make sure you clear your cell output before exporting) CourseNana.COM

    In Google colab: CourseNana.COM

Requirements on the Python notebook (report) CourseNana.COM

● Methodology - 25% CourseNana.COM

  • ○  You need to demonstrate your solution using correct regular expressions. CourseNana.COM

    Results from each step could help to demonstrate your solution better and CourseNana.COM

    be easier to understand. CourseNana.COM

  • ○  You should present your solution in a proper way including all required CourseNana.COM

    steps. Skip any steps will cause a penalty on grade. CourseNana.COM

  • ○  You need to select and use the appropriate Python functions for input, CourseNana.COM

    process and output. CourseNana.COM

  • ○  Your solution should be an efficient one without redundant operations and CourseNana.COM

    unnecessary reading and writing the data. CourseNana.COM

● Report CourseNana.COM

  • ○  The report should be organised in a proper structure to present your CourseNana.COM

    solutions to Task 1 with clear and meaningful titles for sections and CourseNana.COM

    subsections or sub-subsection if needed. CourseNana.COM

  • ○  Each step in your solution should be clearly described. For example, you CourseNana.COM

    can write to explain your idea of the solution, any specific settings, and the CourseNana.COM

    reason for using a particular function, etc. CourseNana.COM

  • ○  Explanation of your results including all intermediate steps is required. CourseNana.COM

    This can help the marking team to understand your solution and give CourseNana.COM

    partial marks if the final results are not fully correct. CourseNana.COM

  • ○  All your codes need proper (but not excessive) commenting. CourseNana.COM

  • ○  You can refer to the notebook templates provided as a guideline for a CourseNana.COM

    properly formatted notebook report. CourseNana.COM

    Task 2: Text Pre-Processing (16/35) CourseNana.COM

    This task touches on the next step of analysing textual data, converting the extracted text data into a numerical representation, thus it can be used for a downstream modelling task. In this task, you are required to write Python code to pre-process a set of youtube comments (in an Excel file) and convert them into numerical representations. The numerical representation is the standard format of text data when (which are suitable for input into NLP systems such as: recommender-systems, information-retrieval algorithms, machine-translation etc.). The most basic step for natural language processing (NLP) tasks is to convert words into numbers for machines to understand & decode patterns within a language. This step, though iterative, plays a significant role in deciding features for your machine learning model/algorithm. CourseNana.COM

    Allowed libraries: ALL CourseNana.COM

organisation and writing - 25% CourseNana.COM

Input Files CourseNana.COM

Output Files (submission) CourseNana.COM

Group<group_number>.xlsx CourseNana.COM

<group_number>_channel_list.csv CourseNana.COM

<group_number>_vocab.txt <group_number>_countvec.txt task2_<group_number>.ipynb task2_<group_number>.py CourseNana.COM

Your group is provided with a unique dataset containing youtube comments (see sample_input_task2.xlsx). Please use the data file with your group_number, i.e. <group_number>.xlsx in the Google drive folder (student_data). The excel file contains worksheets with many youtube comments data. The excel tables have two columns: CourseNana.COM

  • ●  id: Unique comment identifier CourseNana.COM

  • ●  Snippet: a JSON array that contains information about one top level comment for a CourseNana.COM

    particular youtube video, such as the comment text, the channel and video the comment is under. CourseNana.COM

    You are asked to extract the ‘textOriginal’ fields in all top level comments for all youtube video lists that we provide to you. Then pre-process the abstract text and generate a vocabulary list and numerical representation for the corresponding text, which will be used in the model training by your colleagues. The information regarding output files is listed below: CourseNana.COM

  • ●  <group_number>_channel_list.csv contains unique channel ids along with the counts of top level comments(all language, and English only). CourseNana.COM

  • ●  <group_number>_vocab.txt comprises unique stemmed tokens sorted alphabetically, presented in the format of token_index:token, as outlined in Guideline step 4. CourseNana.COM

  • ●  <group_number>_countvec.txt includes numerical representations of all tokens, organised by channels_id and token index, following the format channel_id, token_index:frequency, as outlined in Guideline step 5. CourseNana.COM

    Carefully examine the sample files (here) for detailed information about the output structure. CourseNana.COM

    VERY IMPORTANT NOTE: The sample outputs are just for you to understand the structure of the required output and the correctness of their content in task 2 is not guaranteed. So please do not try to reverse engineer the outputs as it will fail to generate the correct content. CourseNana.COM

    Task 2 Guideline CourseNana.COM

    To complete the above task, please follow the steps below: CourseNana.COM

    Step 1: Data import CourseNana.COM

  • ●  Each excel file contains multiple worksheets. Data are positioned differently in each worksheet. CourseNana.COM

  • ●  You are required to combine all data together and remove any duplicates to perform the next step CourseNana.COM

    Step 2: Text extraction and cleaning CourseNana.COM

  • ●  You are required to extract the ‘textOriginal’ fields in all top level comments. CourseNana.COM

  • ●  Since the comment data contain emojis, you will need to remove the emojis and CourseNana.COM

    normalise the text into lower case for further analysis. CourseNana.COM

    • ○  To remove emojis, make sure your text data is in utf-8 format. CourseNana.COM

    • ○  The list of emojis to remove are in emoji.txt. CourseNana.COM

  • ●  You only require to extract the vocab and countvec lists for english comments from Channels that have at least 15 english comments by using the CourseNana.COM

    Step 3: Generate csv file CourseNana.COM

  • ●  Generate a csv file that contains unique channel ids along with the counts of top level comments(all language, and english). CourseNana.COM

  • ●  The column names are: channel_id, all_comment_count, eng_comment_count. CourseNana.COM

    Step 4: Generate the unigram and bigram lists and output as vocab.txt CourseNana.COM

  • ●  The following steps must be performed (not necessarily in the same order) to complete the assessment. Please note that the order of preprocessing matters and will result in different vocabulary and hence different count vectors. It is part of the assessment to figure out the correct order of preprocessing which makes the most sense as we learned in the unit. You are encouraged to ask questions and discuss them with the teaching team if in doubt. CourseNana.COM

    1. The word tokenization must use the following regular expression, "[a-zA-Z]+" CourseNana.COM

    2. The context-independent and context-dependent stopwords must be removed CourseNana.COM

      from the vocabulary. CourseNana.COM

    3. Tokens should be stemmed using the Porter stemmer. CourseNana.COM

    4. Rare tokens must be removed from the vocab (with the threshold set to be CourseNana.COM

      words appear in less than 1% channels_ids that have at least 15 english CourseNana.COM

      comments . CourseNana.COM

    5. Tokens with a length less than 3 should be removed from the vocab. CourseNana.COM

    6. First 200 meaningful bigrams (i.e., collocations) must be included in the CourseNana.COM

      vocab using PMI measure, then makes sure the collocations can be CourseNana.COM

      collocated within the same comment. CourseNana.COM

    7. Calculate the vocabulary containing both unigrams and bigrams. CourseNana.COM

  • ●  Combine the unigrams and bigrams, sort the list alphabetically in an ascending order and output as vocab.txt CourseNana.COM

    Step 5: Generate the sparse numerical representation and output as countvec.txt CourseNana.COM

1. Re-tokenize your text based on the bigram list generated in step 4 if necessary. CourseNana.COM

langdetect library CourseNana.COM

with DetectorFactory.seed = 0. Note you are deciding the language on a comment CourseNana.COM

level, not a sentence level. CourseNana.COM

  1. Generate sparse representation by using the countvectorizer() function OR directly count the frequency using FreqDist(). CourseNana.COM

  2. Mapping the generated token with the vocabs in step 4 if need CourseNana.COM

  3. Output the sparse numerical representation into txt file with the following format: CourseNana.COM

Submission Requirements CourseNana.COM

You need to submit 5 files: CourseNana.COM

  1. A <group_number>_channel_list.csv file contains information about the top level comment count (all language, and english) for each channel id CourseNana.COM

  2. A that contains the unigrams and bigrams tokens in the following format, Words in the vocabulary must be sorted in alphabetical order. CourseNana.COM

  3. A <group_number>_countvec.txt file, in which each line contains the sparse representations of one of the channel in the following format: CourseNana.COM

    channel_id, token1_index:token1_wordcount, token2_index:token2_wordcount, ... CourseNana.COM

    Please note: the tokens with zero word count should NOT be included in the sparse CourseNana.COM

    representation. CourseNana.COM

  4. A task2_<group_number>.ipynb file that contains your report explaining the code CourseNana.COM

    and the methodology. (make sure you PRINT OUT your cell outputs) CourseNana.COM

  5. A task2_<group_number>.py file for plagiarism checks. (make sure you clear your CourseNana.COM

    cell outputs) CourseNana.COM

Requirements on the Python notebook (report) CourseNana.COM

● Methodology - 25% CourseNana.COM

channel_id1,token1_index:token1_frequency, CourseNana.COM

token2_index:token2_frequency, token3_index:token3_frequency, ... CourseNana.COM

channel_id2,token2_index:token2_frequency, CourseNana.COM

token5_index:token5_frequency, token7_index:token7_frequency, ... CourseNana.COM

channel_id3,token6_index:token6_frequency, CourseNana.COM

token9_index:token9_frequency, token12_index:token12_frequency, ... CourseNana.COM

<group_number>_vocab.txt CourseNana.COM

token:token_index. CourseNana.COM

● Report CourseNana.COM

  • ○  The report should be organised in a proper structure to present your CourseNana.COM

    solutions to Task 2 with clear and meaningful titles for sections and CourseNana.COM

    subsections or sub-subsection if needed. CourseNana.COM

  • ○  Each step in your solution should be clearly described. For example, you CourseNana.COM

    can write to explain your idea of the solution, any specific settings, and the reason for using a particular function, etc. CourseNana.COM

organisation and writing - 25% CourseNana.COM

  • ○  Explanation of your results including all intermediate steps is required. This can help the marking team to understand your solution and give partial marks if the final results are not fully correct. CourseNana.COM

  • ○  All your codes need proper (but not excessive) commenting. CourseNana.COM

  • ○  You can refer to the notebook templates provided as a guideline for a CourseNana.COM

    properly formatted notebook report. CourseNana.COM

    Task 3: Development History (2/35) CourseNana.COM

    For this task, your group is required to provide a comprehensive development history of your assignment, showcasing incremental progress over at least three different time points. The purpose of this task is to demonstrate your ability to manage and document the evolution of your project, including changes made, challenges faced, and collaborative efforts with your group mates. CourseNana.COM

    Submission Requirements CourseNana.COM

    Here are the key components you need to include in your submission(example): CourseNana.COM

    1. Development Timeline: Provide a detailed timeline highlighting at least three significant time points in the development of your assignment. Each time point should be accompanied by a description of the changes made and the rationale behind those changes. CourseNana.COM

    2. Version Screenshots: Include screenshots or snapshots of different versions of your assignment at each time point. This should clearly illustrate the incremental development and any modifications made to the project. CourseNana.COM

    3. Collaborative Effort(If you are doing the assignment with another student): CourseNana.COM

    Document the collaborative effort with your group mates. This can be a description of the contributions made by each team member or screenshots of proof showcasing the collaborative effort with your group mates. CourseNana.COM

    Note: We only require brief descriptions here, they don’t have to be long as long as they reflect the key components we listed above. We recommend you to use Google Colab to complete your assignment, as Google Colab notebooks provide a comprehensive version history. CourseNana.COM

Output Files (submission) CourseNana.COM

<group_number>_development_history_task1.pdf <group_number>_development_history_task2.pdf CourseNana.COM

Instructions- Sharing ipynb link CourseNana.COM

  1. Click on the Share button on the top right corner CourseNana.COM

  2. Make sure under General access section, you have the permission sets to ‘Monash University’ and ‘Editor’, then click on `Copy link` CourseNana.COM

  3. Create a markdown cell at the end of your assignment. Paste the sharelink/create a hyperlink object. CourseNana.COM

CourseNana.COM

4. Double check the link to make sure it is working. CourseNana.COM

Get in Touch with Our Experts

WeChat (微信) WeChat (微信)
Whatsapp WhatsApp
Monash代写,FIT5196代写,Data wrangling代写,Python代写,Data Extraction代写,Monash代编,FIT5196代编,Data wrangling代编,Python代编,Data Extraction代编,Monash代考,FIT5196代考,Data wrangling代考,Python代考,Data Extraction代考,Monashhelp,FIT5196help,Data wranglinghelp,Pythonhelp,Data Extractionhelp,Monash作业代写,FIT5196作业代写,Data wrangling作业代写,Python作业代写,Data Extraction作业代写,Monash编程代写,FIT5196编程代写,Data wrangling编程代写,Python编程代写,Data Extraction编程代写,Monashprogramming help,FIT5196programming help,Data wranglingprogramming help,Pythonprogramming help,Data Extractionprogramming help,Monashassignment help,FIT5196assignment help,Data wranglingassignment help,Pythonassignment help,Data Extractionassignment help,Monashsolution,FIT5196solution,Data wranglingsolution,Pythonsolution,Data Extractionsolution,