1. Homepage
  2. Programming
  3. CS109A Introduction to Data Science - Homework 1: Web Scraping, Data Parsing, and EDA

CS109A Introduction to Data Science - Homework 1: Web Scraping, Data Parsing, and EDA

Engage in a Conversation
CS109APythonUSHarvard UniversityCOMPSCI 109AIntroduction to Data ScienceWeb ScrapingData Parsing

CS109A Introduction to Data Science 

Homework 1: Web Scraping, Data Parsing, and EDA

Instructions

  • To submit your assignment follow the instructions given in Canvas.
  • Exercise responsible scraping. Web servers can become slow or unresponsive if they receive too many requests from the same source in a short amount of time. Use a delay of 2 seconds between requests in your code. 
  • Web scraping requests can take several minutes. This is another reason why you should not wait until the last minute to do this homework.
  • Plots should be legible and interpretable without having to refer to the code that generated them.
  • When asked to interpret a visualization, do not simply describe it (e.g., "the curve has a steep slope up"), but instead explain what you think the plot means.
  • The use of 'hard-coded' values to try and pass tests rather than solving problems programmatically will not receive credit.
  • The use of extremely inefficient or error-prone code (e.g., copy-pasting nearly identical commands rather than looping) may result in only partial credit.
  • Enable scrolling output on cells with very long output.
  • Feel free to add additional code or markdown cells.
  • Ensure your code runs top to bottom without error and passes all tests by restarting the kernel and running all cells. This is how the notebook will be evaluated (this can take a few minutes)
Overview

In this assignment you'll practice scraping, parsing, and analyzing HTML data pulled from web. CourseNana.COM

Specifically, you'll extract information about each person on IMDb's list of "Top 100 Stars for 2021" (https://www.imdb.com/list/ls577894422/), perform some EDA, ask some questions of the data, and interpret your findings. CourseNana.COM

For example, we might like to know: CourseNana.COM

  • What is the relationship between when a performer started their career and their total number of acting credits? 
  • How many stars started as child actors?
  • How do the distribution of ages compare across genders?
  • Who is the most prolific actress or actor in IMDb's list of the Top 100 Stars for 2021? 

These questions and more are addressed in details below. CourseNana.COM

Part 1 - Scraping and Parsing

Q1 - Scrape Top Stars List

Scrape the HTML from the webpage of the "Top 100 Stars for 2021" (https://www.imdb.com/list/ls577894422/) into a requests object and name it my_page. CourseNana.COM

Points: 2.5 CourseNana.COM

In [ ]:
my_page = ...
In [ ]:
grader.check("q1")
Q2 - Making BeautifulSoup

Create a Beautiful Soup object named star_soup from the HTML content in my_page. CourseNana.COM

Points: 2.5 CourseNana.COM

In [ ]:
star_soup = ...
In [ ]:
# check your code - you should see a familiar HTML code
print (star_soup.prettify()[:1000])
In [ ]:
grader.check("q2")
Q3 - Parse Stars

Write a function called parse_stars that accepts star_soup as its input and returns a list of dictionaries to be saved in a variable called star_list(see function definition below for details) CourseNana.COM

IMDb star pages do not have a 'sex' or 'gender' field. Some roles are gender neutral (e.g., "writer") and relying on actor/actress distinctions will also give results inconsistent with the more detailed data available on the site. You should infer gender based on the frequency of the personal pronouns used in each star's truncated bio that appears on the main "Top 100 Star of 2021" page. CourseNana.COM

You may find a data structure like this useful: CourseNana.COM

pronouns = {'woman': ['she','her'],
            'man': ['he', 'him', 'his'],
            'non-binary': ['they', 'them', 'their']}

Simply count the occurrences of the different pronouns in the bio and make the classification based on the grouping with the majority count. CourseNana.COM

Hint: Throughout this assignment you will likely find it useful to create small 'helper' functions which are then used by your larger functions like parse_stars CourseNana.COM

Function
--------
parse_stars

Input
------
star_soup: the soup object with the scraped page

Returns
-------
a list of dictionaries; each dictionary corresponds to a star profile and has the following data:

    name: (str) the name of the star
    role: (str) role in film designated on top 100 page (e.g., 'actress', 'writer')
    gender: (str) 'man', 'woman', or 'non-binary' based in pronoun counts in top 100 page bio
    url: (str) the url of the link under star's name that leads to a page with more details
    page: (bs4.BeautifulSoup) BS object acquired by scraping and parsing the above 'url' page

Example:
--------
{'name': 'Elizabeth Olsen',
 'role': 'actress',
 'gender': 'woman',
 'url': 'https://www.imdb.com/name/nm0647634',
 'page': <!DOCTYPE html>
 <html xmlns:fb="http://www.facebook.com/2008/fbml" xmlns:og="http://ogp.me/ns#">
 <head>
 <meta charset="utf-8"/>
 <script type="text/javascript">var IMDbTimer={starttime: new Date().getTime(),pt:'java'};</script>
 <script>
 ...
}

Points: 25 CourseNana.COM

In [ ]:
def parse_stars(star_soup) -> list:
    ...

star_list = ...
In [ ]:
# check your code
# this list is large because of the html code into the `page` field
# to get a better picture, print only the first element
star_list[0]
In [ ]:
grader.check("q3")
Q4 - Create Star Table

Write a function called create_star_table, which takes star_list as an input and returns a new list of dictionaries, star_table, which includes more extensive information about each star extracted from their page. CourseNana.COM

See function the definition below for more details. CourseNana.COM

Note: The years of some credits are ranges (e.g., 2001-2002). You should use only the starting year. CourseNana.COM

Hint: Carefuly note the ordering, case, and data type of the values in each dictionary. CourseNana.COM

Function
--------
create_star_table

Input
------
star_list (list of dictionaries)

Returns
-------

a list of dictionaries; each dictionary corresponds to a star profile and has the following data:

    name: (str) the name of the star
    role: (str) 'actor', 'actress', 'writer', etc. (note the case)
    gender: (str) 'woman', 'man', or 'non-binary' (based on pronouns in bio)  
    year_born : (int) year star was born (some pages do note have a full date so we'll just use year)
    first_credit: (str) title of their first credit in their capacity designated by 'role'
    year_first_credit: (int) the year they made their first movie or TV show
    num_credits: (int) number of movies or TV shows they have made over their career in their capacity designated by 'role'

--------
Example:

{'name': 'Elizabeth Olsen',
  'role': 'actress',
  'gender': 'woman',
  'year_born': 1989,
  'first_credit': 'How the West Was Fun',
  'year_first_credit': 1994,
  'num_credits': 27}

Points: 25 CourseNana.COM

In [ ]:
def create_star_table(star_list: list) -> list:
    ...
In [ ]:
star_table = ...
In [ ]:
# check your code
star_table
In [ ]:
grader.check("q4")

? Saving and Restoring Our List of Dictionaries

It's good practice to save your data structure to disk once you've done all of your scraping. This way you can often avoid repeating all the HTTP requests which can be slow (and taxing on servers!). CourseNana.COM

We had to wait until this stage to save our data structure as the bs4.BeautifulSoup object in our original star_list (the page values) can not be easily serialized. CourseNana.COM

The code provided below will save the data structure to a JSON file named starinfo.json in the data subdirectory. CourseNana.COM

In [ ]:
# your code here
with open("data/starinfo.json", "w") as f:
    json.dump(star_table, f)

To confirm this worked as intended, open the JSON file and load its contents into a variable for viewing. CourseNana.COM

In [ ]:
with open("data/starinfo.json", "r") as f:
    star_table = json.load(f)
    
# output should be the same
star_table

This method of saving and restoring data structures will likely be useful to you in the future! CourseNana.COM

Part 2 - Pandas and EDA

Q5 - Creating a DataFrame

For the sake of consistency, we've provide our own JSON file, data/starinfo_2021_staff.json, which you should use for the rest of the notebook. Load the contents of this JSON file and use it to create a Pandas DataFrame called frame. CourseNana.COM

Hint: Remember, the data structure in the JSON file is a list of dictionaries. CourseNana.COM

Points: 2.5 CourseNana.COM

In [ ]:
frame = ...
In [ ]:
# Take a peek
frame.head(20)
In [ ]:
grader.check("q5")
Q6 - Creating a New Column

Add a new column to your dataframe with the approximate age of each star at the time of their first credit. Name this new column age_at_first_credit. CourseNana.COM

NOTE: We say approximate age because we've only collected the year of birth as several star pages do not include a full birth date. The approximate age of a star in a given year should be the difference between that year and the star's birth year. CourseNana.COM

Points: 2.5 CourseNana.COM

In [ ]:
 
In [ ]:
# You should visually inspect some of your results
frame.head()
Q7 - Subsetting and Sorting

In this section you'll subset and sort the DataFrame to answer a pair of questions: CourseNana.COM

Q7.1 - Child Stars

Which stars received their first credit before the age of 11? CourseNana.COM

Store the resulting dataframe as child_starssorted by age_at_first_credit in ascending order.\ Store the number of such "child stars" in num_child_stars. CourseNana.COM

Points: 2.5 CourseNana.COM

In [ ]:
child_stars = ...
num_child_stars = ...

print ("{} stars received their first credit before the age of 11.".format(num_child_stars))
display(child_stars)
In [ ]:
grader.check("q7.1")
Q7.2 - Late Bloomers

Which stars received their first credit at 26-years-old or older? CourseNana.COM

Store the resulting dataframe as late_bloomerssorted by age_at_first_credit in descending order.\ Store the number of such "late bloomers" in num_late_bloomers. CourseNana.COM

Points: 2.5 CourseNana.COM

In [ ]:
late_bloomers = ...
num_late_bloomers = ...

print ("{} stars received their first credit at 26 or older.".format(num_late_bloomers))
display(late_bloomers)
In [ ]:
grader.check("q7.2")
Q8 - Visualization 

In this section you'll use your Python visualization skills to further explore the data: CourseNana.COM

Q8.1 - Exploring Trends

Create 2 scatter plots: one showing the relationship between age at first movie and number of credits, the other between year born and number of credits. CourseNana.COM

What can you say about these relationships? Are there any apparent outliers? Please limit your written responses to 4 sentences or fewer. CourseNana.COM

Points: 5 CourseNana.COM

Type your answer here, replacing this text. CourseNana.COM

In [ ]:
# your code here
Q8.2 - Age Distributions

Let's look at the distribution of movie and TV performers' ages by gender. CourseNana.COM

Create two plots, each plot consisting of two overlayed histograms comparing the distribution of men's current ages to women's current ages. CourseNana.COM

In the first plot, the distributions should be normalized to show the proportion of each gender at each age. CourseNana.COM

The second plot should show the counts of each gender at each age. CourseNana.COM

Interpret the resulting plots. (4 sentences or fewer) CourseNana.COM

NOTE 1: Again, we are dealing with approximate ages as defined above. CourseNana.COM

NOTE 2: You should exclude those whose role is not 'actor' or 'actress' from your analysis CourseNana.COM

Points: 10 CourseNana.COM

Type your answer here, replacing this text. CourseNana.COM


Get in Touch with Our Experts

WeChat WeChat
Whatsapp WhatsApp
CS109A代写,Python代写,US代写,Harvard University代写,COMPSCI 109A代写,Introduction to Data Science代写,Web Scraping代写,Data Parsing代写,CS109A代编,Python代编,US代编,Harvard University代编,COMPSCI 109A代编,Introduction to Data Science代编,Web Scraping代编,Data Parsing代编,CS109A代考,Python代考,US代考,Harvard University代考,COMPSCI 109A代考,Introduction to Data Science代考,Web Scraping代考,Data Parsing代考,CS109Ahelp,Pythonhelp,UShelp,Harvard Universityhelp,COMPSCI 109Ahelp,Introduction to Data Sciencehelp,Web Scrapinghelp,Data Parsinghelp,CS109A作业代写,Python作业代写,US作业代写,Harvard University作业代写,COMPSCI 109A作业代写,Introduction to Data Science作业代写,Web Scraping作业代写,Data Parsing作业代写,CS109A编程代写,Python编程代写,US编程代写,Harvard University编程代写,COMPSCI 109A编程代写,Introduction to Data Science编程代写,Web Scraping编程代写,Data Parsing编程代写,CS109Aprogramming help,Pythonprogramming help,USprogramming help,Harvard Universityprogramming help,COMPSCI 109Aprogramming help,Introduction to Data Scienceprogramming help,Web Scrapingprogramming help,Data Parsingprogramming help,CS109Aassignment help,Pythonassignment help,USassignment help,Harvard Universityassignment help,COMPSCI 109Aassignment help,Introduction to Data Scienceassignment help,Web Scrapingassignment help,Data Parsingassignment help,CS109Asolution,Pythonsolution,USsolution,Harvard Universitysolution,COMPSCI 109Asolution,Introduction to Data Sciencesolution,Web Scrapingsolution,Data Parsingsolution,