  3. STAT 7008 - Assignment 3: Web Scraping

STAT 7008 - Assignment 3: Web Scraping

STAT 7008 - Assignment 3 CourseNana.COM

Note: A3 is 20% of the overall assessment. The 100 points in A3 will be rescaled to 20% in the final score.

Web Scraping

1. (25 points) Crawl information from https://www.sciencedirect.com

  1. (1)  (13 points) Crawl some key information about all articles published in 2022 from the website https://www.sciencedirect.com/journal/journal-of-econometrics/issues, including year, volume, article content, title, authors and pages. Crawl the volume numbers from 226 to 230 only.
  2. (2)  (6 points) Remove “\xa0” in volume_name and store the crawled data into pandas

DataFrame.

  1. (3)  (6 points) Filter the author with Null value and then find the top 10 authors that published the most articles.

Hint:

  1. Click the button of the targeted item
  2. Pass the html to BeautifulSoup and get all links
  3. Use requests to get article content, title, authors and pages for each block

