STAT 7008 - Assignment 3
Note: A3 is 20% of the overall assessment. The 100 points in A3 will be rescaled to 20% in the final score.
Web Scraping
1. (25 points) Crawl information from
- (1) (13 points) Crawl some key information about all articles published in 2022 from the website, including year, volume, article content, title, authors and pages. Crawl the volume numbers from 226 to 230 only.
- (2) (6 points) Remove “\xa0” in volume_name and store the crawled data into pandas
- (3) (6 points) Filter the author with Null value and then find the top 10 authors that published the most articles.
- Click the button of the targeted item
- Pass the html to BeautifulSoup and get all links
- Use requests to get article content, title, authors and pages for each block