STAT 7008 - Assignment 3
Note: A3 is 20% of the overall assessment. The 100 points in A3 will be rescaled to 20% in the final score.
Web Scraping
1. (25 points) Crawl information from https://www.sciencedirect.com
- (1) (13 points) Crawl some key information about all articles published in 2022 from the website https://www.sciencedirect.com/journal/journal-of-econometrics/issues, including year, volume, article content, title, authors and pages. Crawl the volume numbers from 226 to 230 only.
- (2) (6 points) Remove “\xa0” in volume_name and store the crawled data into pandas
DataFrame.
- (3) (6 points) Filter the author with Null value and then find the top 10 authors that published the most articles.
Hint:
- Click the button of the targeted item
- Pass the html to BeautifulSoup and get all links
- Use requests to get article content, title, authors and pages for each block