Homepage
Programming
Assignment 3: Deduplication

Assignment 3: Deduplication

Engage in a Conversation

Assignment 3: Deduplication CourseNana.COM

Beatrix Jones 9/19/2022 CourseNana.COM

Consider the file “articles.csv” which contains information for the purposes of conducting a systematic review. A systematic review aims to capture all articles about a given topic by conducting a specific search in several databases, and then pooling the results. As most article databases (eg Scopus, Web of Science) have significant overlap, an important step is to “deduplicate” the pooled results. Because the goal is to capture all articles, it is important that articles that are not duplicates are not removed. This is complicated by the fact that there are frequently formatting changes between the different records (capitailization, punctuation, abbreviations, etc.) The following criteria has been agreed by the people conducting the review: two articles will be considered identical if they have the same title, disregarding changes in spacing, capitalization, and punctuation, the same year of publication, and the same first 6 characters for the first author’s name (again stripping capitalization, spacing, punctuation). CourseNana.COM

CourseNana.COM

We will not do the full deduplication in this assignment. There is a subproblem that must be solved first (however the context above is important). There is a subset of articles that each have a line in articles.csv, but have very little information filled in. In particular they have no title, making the deduplication strategy tricky. However, much of the information we might want for our strategy is embedded in the “url” field. Your task is to find the articles with no title, and repopulate any missing fields needed for the deduplication strategy using the url information. Note that because this will be used for the process of deduplication, retaining things like punctuation is not important. CourseNana.COM

You will hand in either a pdf file containing everything, or an HTML file and the .Rmd file used to produce it. You should also hand in “articlesNew.csv” which has just the improved version of the records you have altered. It should have the same number of columns (and same headings) as articles.csv. CourseNana.COM

CourseNana.COM

1. (20 marks) Write code to accomplish the task using the basic text functions we have learned about. You should explain and demonstrate the code–you will lose marks if you do not explain AND demonstrate. The code must work and be understandable, but does not need to have undergone “refactoring.” CourseNana.COM

2. (10 marks) Assuming you will need to perform this task for other files in the future, make a list of things you about your code that you would address during a “refactoring” process. You do not need to make these changes, just describe the sort of things you would do. Significantly poor R code that is not mentioned in this section will lose marks. CourseNana.COM

3. (10 marks) How does the intended deduplication process inform the choices you have made? Explain. (Another way to think of this question would be, would you do anything differently if you were extracting this information for a different purpose, eg to appear in a bibliography?) CourseNana.COM

CourseNana.COM

Get in Touch with Our Experts

WeChat (微信)

Last: ECE 4424 / CS 4824 - Machine learning - Problem Set 2: Maximum likelihood estimation, Gaussian discriminant analysis and Linear Regression

Next: CPEN 221 Principles of Software Construction - Mini Project: n-grams, Autocompletion, and Gender Bias

University of Auckland代写,New Zealand代写,Deduplication代写,R代写,University of Auckland代编,New Zealand代编,Deduplication代编,R代编,University of Auckland代考,New Zealand代考,Deduplication代考,R代考,University of Aucklandhelp,New Zealandhelp,Deduplicationhelp,Rhelp,University of Auckland作业代写,New Zealand作业代写,Deduplication作业代写,R作业代写,University of Auckland编程代写,New Zealand编程代写,Deduplication编程代写,R编程代写,University of Aucklandprogramming help,New Zealandprogramming help,Deduplicationprogramming help,Rprogramming help,University of Aucklandassignment help,New Zealandassignment help,Deduplicationassignment help,Rassignment help,University of Aucklandsolution,New Zealandsolution,Deduplicationsolution,Rsolution,