Assignment 3: Deduplication
Beatrix Jones 9/19/2022
Consider the file “articles.csv” which contains information for the purposes of conducting a systematic review. A systematic review aims to capture all articles about a given topic by conducting a specific search in several databases, and then pooling the results. As most article databases (eg Scopus, Web of Science) have significant overlap, an important step is to “deduplicate” the pooled results. Because the goal is to capture all articles, it is important that articles that are not duplicates are not removed. This is complicated by the fact that there are frequently formatting changes between the different records (capitailization, punctuation, abbreviations, etc.) The following criteria has been agreed by the people conducting the review: two articles will be considered identical if they have the same title, disregarding changes in spacing, capitalization, and punctuation, the same year of publication, and the same first 6 characters for the first author’s name (again stripping capitalization, spacing, punctuation).
We will not do the full deduplication in this assignment. There is a subproblem that must be solved first (however the context above is important). There is a subset of articles that each have a line in articles.csv, but have very little information filled in. In particular they have no title, making the deduplication strategy tricky. However, much of the information we might want for our strategy is embedded in the “url” field. Your task is to find the articles with no title, and repopulate any missing fields needed for the deduplication strategy using the url information. Note that because this will be used for the process of deduplication, retaining things like punctuation is not important.
You will hand in either a pdf file containing everything, or an HTML file and the .Rmd file used to produce it. You should also hand in “articlesNew.csv” which has just the improved version of the records you have altered. It should have the same number of columns (and same headings) as articles.csv.
1. (20 marks) Write code to accomplish the task using the basic text functions we have learned about. You should explain and demonstrate the code–you will lose marks if you do not explain AND demonstrate. The code must work and be understandable, but does not need to have undergone “refactoring.”
2. (10 marks) Assuming you will need to perform this task for other files in the future, make a list of things you about your code that you would address during a “refactoring” process. You do not need to make these changes, just describe the sort of things you would do. Significantly poor R code that is not mentioned in this section will lose marks.
3. (10 marks) How does the intended deduplication process inform the choices you have made? Explain. (Another way to think of this question would be, would you do anything differently if you were extracting this information for a different purpose, eg to appear in a bibliography?)