CS109A Introduction to Data Science
Homework 1: Web Scraping, Data Parsing, and EDA
Instructions
- To submit your assignment follow the instructions given in Canvas.
- Exercise responsible scraping. Web servers can become slow or unresponsive if they receive too many requests from the same source in a short amount of time. Use a delay of 2 seconds between requests in your code.
- Web scraping requests can take several minutes. This is another reason why you should not wait until the last minute to do this homework.
- Plots should be legible and interpretable without having to refer to the code that generated them.
- When asked to interpret a visualization, do not simply describe it (e.g., "the curve has a steep slope up"), but instead explain what you think the plot means.
- The use of 'hard-coded' values to try and pass tests rather than solving problems programmatically will not receive credit.
- The use of extremely inefficient or error-prone code (e.g., copy-pasting nearly identical commands rather than looping) may result in only partial credit.
- Enable scrolling output on cells with very long output.
- Feel free to add additional code or markdown cells.
- Ensure your code runs top to bottom without error and passes all tests by restarting the kernel and running all cells. This is how the notebook will be evaluated (this can take a few minutes)
Overview
- To submit your assignment follow the instructions given in Canvas.
- Exercise responsible scraping. Web servers can become slow or unresponsive if they receive too many requests from the same source in a short amount of time. Use a delay of 2 seconds between requests in your code.
- Web scraping requests can take several minutes. This is another reason why you should not wait until the last minute to do this homework.
- Plots should be legible and interpretable without having to refer to the code that generated them.
- When asked to interpret a visualization, do not simply describe it (e.g., "the curve has a steep slope up"), but instead explain what you think the plot means.
- The use of 'hard-coded' values to try and pass tests rather than solving problems programmatically will not receive credit.
- The use of extremely inefficient or error-prone code (e.g., copy-pasting nearly identical commands rather than looping) may result in only partial credit.
- Enable scrolling output on cells with very long output.
- Feel free to add additional code or markdown cells.
- Ensure your code runs top to bottom without error and passes all tests by restarting the kernel and running all cells. This is how the notebook will be evaluated (this can take a few minutes)
In this assignment you'll practice scraping, parsing, and analyzing HTML data pulled from web.
CourseNana.COM
Specifically, you'll extract information about each person on IMDb's list of "Top 100 Stars for 2021" (https://www.imdb.com/list/ls577894422/), perform some EDA, ask some questions of the data, and interpret your findings.
CourseNana.COM
For example, we might like to know:
CourseNana.COM
- What is the relationship between when a performer started their career and their total number of acting credits?
- How many stars started as child actors?
- How do the distribution of ages compare across genders?
- Who is the most prolific actress or actor in IMDb's list of the Top 100 Stars for 2021?
These questions and more are addressed in details below.
CourseNana.COM
In this assignment you'll practice scraping, parsing, and analyzing HTML data pulled from web. CourseNana.COM
Specifically, you'll extract information about each person on IMDb's list of "Top 100 Stars for 2021" (https://www.imdb.com/list/ls577894422/), perform some EDA, ask some questions of the data, and interpret your findings. CourseNana.COM
For example, we might like to know: CourseNana.COM
- What is the relationship between when a performer started their career and their total number of acting credits?
- How many stars started as child actors?
- How do the distribution of ages compare across genders?
- Who is the most prolific actress or actor in IMDb's list of the Top 100 Stars for 2021?
These questions and more are addressed in details below. CourseNana.COM
Part 1 - Scraping and Parsing
Q1 - Scrape Top Stars ListScrape the HTML from the webpage of the "Top 100 Stars for 2021" (https://www.imdb.com/list/ls577894422/) into a requests
object and name it my_page
.
CourseNana.COM
Points: 2.5
CourseNana.COM
In [ ]:my_page = ...
In [ ]:grader.check("q1")
Q2 - Making BeautifulSoupCreate a Beautiful Soup object named star_soup
from the HTML content in my_page
.
CourseNana.COM
Points: 2.5
CourseNana.COM
In [ ]:star_soup = ...
In [ ]:# check your code - you should see a familiar HTML code
print (star_soup.prettify()[:1000])
In [ ]:grader.check("q2")
Q3 - Parse StarsWrite a function called parse_stars
that accepts star_soup
as its input and returns a list of dictionaries to be saved in a variable called star_list
(see function definition below for details)
CourseNana.COM
IMDb star pages do not have a 'sex' or 'gender' field. Some roles are gender neutral (e.g., "writer") and relying on actor/actress distinctions will also give results inconsistent with the more detailed data available on the site. You should infer gender based on the frequency of the personal pronouns used in each star's truncated bio that appears on the main "Top 100 Star of 2021" page.
CourseNana.COM
You may find a data structure like this useful:
CourseNana.COM
pronouns = {'woman': ['she','her'],
'man': ['he', 'him', 'his'],
'non-binary': ['they', 'them', 'their']}
Simply count the occurrences of the different pronouns in the bio and make the classification based on the grouping with the majority count.
CourseNana.COM
Hint: Throughout this assignment you will likely find it useful to create small 'helper' functions which are then used by your larger functions like parse_stars
CourseNana.COM
Function
--------
parse_stars
Input
------
star_soup: the soup object with the scraped page
Returns
-------
a list of dictionaries; each dictionary corresponds to a star profile and has the following data:
name: (str) the name of the star
role: (str) role in film designated on top 100 page (e.g., 'actress', 'writer')
gender: (str) 'man', 'woman', or 'non-binary' based in pronoun counts in top 100 page bio
url: (str) the url of the link under star's name that leads to a page with more details
page: (bs4.BeautifulSoup) BS object acquired by scraping and parsing the above 'url' page
Example:
--------
{'name': 'Elizabeth Olsen',
'role': 'actress',
'gender': 'woman',
'url': 'https://www.imdb.com/name/nm0647634',
'page': <!DOCTYPE html>
<html xmlns:fb="http://www.facebook.com/2008/fbml" xmlns:og="http://ogp.me/ns#">
<head>
<meta charset="utf-8"/>
<script type="text/javascript">var IMDbTimer={starttime: new Date().getTime(),pt:'java'};</script>
<script>
...
}
Points: 25
CourseNana.COM
In [ ]:def parse_stars(star_soup) -> list:
...
star_list = ...
In [ ]:# check your code
# this list is large because of the html code into the `page` field
# to get a better picture, print only the first element
star_list[0]
In [ ]:grader.check("q3")
Q4 - Create Star TableWrite a function called create_star_table
, which takes star_list
as an input and returns a new list of dictionaries, star_table
, which includes more extensive information about each star extracted from their page
.
CourseNana.COM
See function the definition below for more details.
CourseNana.COM
Note: The years of some credits are ranges (e.g., 2001-2002). You should use only the starting year.
CourseNana.COM
Hint: Carefuly note the ordering, case, and data type of the values in each dictionary.
CourseNana.COM
Function
--------
create_star_table
Input
------
star_list (list of dictionaries)
Returns
-------
a list of dictionaries; each dictionary corresponds to a star profile and has the following data:
name: (str) the name of the star
role: (str) 'actor', 'actress', 'writer', etc. (note the case)
gender: (str) 'woman', 'man', or 'non-binary' (based on pronouns in bio)
year_born : (int) year star was born (some pages do note have a full date so we'll just use year)
first_credit: (str) title of their first credit in their capacity designated by 'role'
year_first_credit: (int) the year they made their first movie or TV show
num_credits: (int) number of movies or TV shows they have made over their career in their capacity designated by 'role'
--------
Example:
{'name': 'Elizabeth Olsen',
'role': 'actress',
'gender': 'woman',
'year_born': 1989,
'first_credit': 'How the West Was Fun',
'year_first_credit': 1994,
'num_credits': 27}
Points: 25
CourseNana.COM
In [ ]:def create_star_table(star_list: list) -> list:
...
In [ ]:star_table = ...
In [ ]:# check your code
star_table
In [ ]:grader.check("q4")
Scrape the HTML from the webpage of the "Top 100 Stars for 2021" (https://www.imdb.com/list/ls577894422/) into a requests
object and name it my_page
.
CourseNana.COM
Points: 2.5 CourseNana.COM
my_page = ...
grader.check("q1")
Create a Beautiful Soup object named star_soup
from the HTML content in my_page
.
CourseNana.COM
Points: 2.5 CourseNana.COM
star_soup = ...
# check your code - you should see a familiar HTML code
print (star_soup.prettify()[:1000])
grader.check("q2")
Write a function called parse_stars
that accepts star_soup
as its input and returns a list of dictionaries to be saved in a variable called star_list
(see function definition below for details)
CourseNana.COM
IMDb star pages do not have a 'sex' or 'gender' field. Some roles are gender neutral (e.g., "writer") and relying on actor/actress distinctions will also give results inconsistent with the more detailed data available on the site. You should infer gender based on the frequency of the personal pronouns used in each star's truncated bio that appears on the main "Top 100 Star of 2021" page. CourseNana.COM
You may find a data structure like this useful: CourseNana.COM
pronouns = {'woman': ['she','her'],
'man': ['he', 'him', 'his'],
'non-binary': ['they', 'them', 'their']}
Simply count the occurrences of the different pronouns in the bio and make the classification based on the grouping with the majority count. CourseNana.COM
Hint: Throughout this assignment you will likely find it useful to create small 'helper' functions which are then used by your larger functions like
parse_stars
CourseNana.COM
Function
--------
parse_stars
Input
------
star_soup: the soup object with the scraped page
Returns
-------
a list of dictionaries; each dictionary corresponds to a star profile and has the following data:
name: (str) the name of the star
role: (str) role in film designated on top 100 page (e.g., 'actress', 'writer')
gender: (str) 'man', 'woman', or 'non-binary' based in pronoun counts in top 100 page bio
url: (str) the url of the link under star's name that leads to a page with more details
page: (bs4.BeautifulSoup) BS object acquired by scraping and parsing the above 'url' page
Example:
--------
{'name': 'Elizabeth Olsen',
'role': 'actress',
'gender': 'woman',
'url': 'https://www.imdb.com/name/nm0647634',
'page': <!DOCTYPE html>
<html xmlns:fb="http://www.facebook.com/2008/fbml" xmlns:og="http://ogp.me/ns#">
<head>
<meta charset="utf-8"/>
<script type="text/javascript">var IMDbTimer={starttime: new Date().getTime(),pt:'java'};</script>
<script>
...
}
Points: 25 CourseNana.COM
def parse_stars(star_soup) -> list:
...
star_list = ...
# check your code
# this list is large because of the html code into the `page` field
# to get a better picture, print only the first element
star_list[0]
grader.check("q3")
Write a function called create_star_table
, which takes star_list
as an input and returns a new list of dictionaries, star_table
, which includes more extensive information about each star extracted from their page
.
CourseNana.COM
See function the definition below for more details. CourseNana.COM
Note: The years of some credits are ranges (e.g., 2001-2002). You should use only the starting year. CourseNana.COM
Hint: Carefuly note the ordering, case, and data type of the values in each dictionary. CourseNana.COM
Function
--------
create_star_table
Input
------
star_list (list of dictionaries)
Returns
-------
a list of dictionaries; each dictionary corresponds to a star profile and has the following data:
name: (str) the name of the star
role: (str) 'actor', 'actress', 'writer', etc. (note the case)
gender: (str) 'woman', 'man', or 'non-binary' (based on pronouns in bio)
year_born : (int) year star was born (some pages do note have a full date so we'll just use year)
first_credit: (str) title of their first credit in their capacity designated by 'role'
year_first_credit: (int) the year they made their first movie or TV show
num_credits: (int) number of movies or TV shows they have made over their career in their capacity designated by 'role'
--------
Example:
{'name': 'Elizabeth Olsen',
'role': 'actress',
'gender': 'woman',
'year_born': 1989,
'first_credit': 'How the West Was Fun',
'year_first_credit': 1994,
'num_credits': 27}
Points: 25 CourseNana.COM
def create_star_table(star_list: list) -> list:
...
star_table = ...
# check your code
star_table
grader.check("q4")
? Saving and Restoring Our List of Dictionaries
It's good practice to save your data structure to disk once you've done all of your scraping. This way you can often avoid repeating all the HTTP requests which can be slow (and taxing on servers!).
CourseNana.COM
We had to wait until this stage to save our data structure as the bs4.BeautifulSoup
object in our original star_list
(the page
values) can not be easily serialized.
CourseNana.COM
The code provided below will save the data structure to a JSON file named starinfo.json
in the data subdirectory.
CourseNana.COM
In [ ]:# your code here
with open("data/starinfo.json", "w") as f:
json.dump(star_table, f)
To confirm this worked as intended, open the JSON file and load its contents into a variable for viewing.
CourseNana.COM
In [ ]:with open("data/starinfo.json", "r") as f:
star_table = json.load(f)
# output should be the same
star_table
This method of saving and restoring data structures will likely be useful to you in the future!
CourseNana.COM
It's good practice to save your data structure to disk once you've done all of your scraping. This way you can often avoid repeating all the HTTP requests which can be slow (and taxing on servers!). CourseNana.COM
We had to wait until this stage to save our data structure as the bs4.BeautifulSoup
object in our original star_list
(the page
values) can not be easily serialized.
CourseNana.COM
The code provided below will save the data structure to a JSON file named starinfo.json
in the data subdirectory.
CourseNana.COM
# your code here
with open("data/starinfo.json", "w") as f:
json.dump(star_table, f)
To confirm this worked as intended, open the JSON file and load its contents into a variable for viewing. CourseNana.COM
with open("data/starinfo.json", "r") as f:
star_table = json.load(f)
# output should be the same
star_table
This method of saving and restoring data structures will likely be useful to you in the future! CourseNana.COM
Part 2 - Pandas and EDA
Q5 - Creating a DataFrameFor the sake of consistency, we've provide our own JSON file, data/starinfo_2021_staff.json
, which you should use for the rest of the notebook. Load the contents of this JSON file and use it to create a Pandas DataFrame called frame
.
CourseNana.COM
Hint: Remember, the data structure in the JSON file is a list of dictionaries.
CourseNana.COM
Points: 2.5
CourseNana.COM
In [ ]:frame = ...
In [ ]:# Take a peek
frame.head(20)
In [ ]:grader.check("q5")
Q6 - Creating a New ColumnAdd a new column to your dataframe with the approximate age of each star at the time of their first credit. Name this new column age_at_first_credit
.
CourseNana.COM
NOTE: We say approximate age because we've only collected the year of birth as several star pages do not include a full birth date. The approximate age of a star in a given year should be the difference between that year and the star's birth year.
CourseNana.COM
Points: 2.5
CourseNana.COM
In [ ]:
In [ ]:# You should visually inspect some of your results
frame.head()
Q7 - Subsetting and SortingIn this section you'll subset and sort the DataFrame to answer a pair of questions:
CourseNana.COM
Q7.1 - Child StarsWhich stars received their first credit before the age of 11?
CourseNana.COM
Store the resulting dataframe as child_stars
sorted by age_at_first_credit
in ascending order.\ Store the number of such "child stars" in num_child_stars
.
CourseNana.COM
Points: 2.5
CourseNana.COM
In [ ]:child_stars = ...
num_child_stars = ...
print ("{} stars received their first credit before the age of 11.".format(num_child_stars))
display(child_stars)
In [ ]:grader.check("q7.1")
Q7.2 - Late BloomersWhich stars received their first credit at 26-years-old or older?
CourseNana.COM
Store the resulting dataframe as late_bloomers
sorted by age_at_first_credit
in descending order.\ Store the number of such "late bloomers" in num_late_bloomers
.
CourseNana.COM
Points: 2.5
CourseNana.COM
In [ ]:late_bloomers = ...
num_late_bloomers = ...
print ("{} stars received their first credit at 26 or older.".format(num_late_bloomers))
display(late_bloomers)
In [ ]:grader.check("q7.2")
Q8 - Visualization In this section you'll use your Python visualization skills to further explore the data:
CourseNana.COM
Q8.1 - Exploring TrendsCreate 2 scatter plots: one showing the relationship between age at first movie and number of credits, the other between year born and number of credits.
CourseNana.COM
What can you say about these relationships? Are there any apparent outliers? Please limit your written responses to 4 sentences or fewer.
CourseNana.COM
Points: 5
CourseNana.COM
Type your answer here, replacing this text.
CourseNana.COM
In [ ]:# your code here
Q8.2 - Age DistributionsLet's look at the distribution of movie and TV performers' ages by gender.
CourseNana.COM
Create two plots, each plot consisting of two overlayed histograms comparing the distribution of men's current ages to women's current ages.
CourseNana.COM
In the first plot, the distributions should be normalized to show the proportion of each gender at each age.
CourseNana.COM
The second plot should show the counts of each gender at each age.
CourseNana.COM
Interpret the resulting plots. (4 sentences or fewer)
CourseNana.COM
NOTE 1: Again, we are dealing with approximate ages as defined above.
CourseNana.COM
NOTE 2: You should exclude those whose role
is not 'actor' or 'actress' from your analysis
CourseNana.COM
Points: 10
CourseNana.COM
Type your answer here, replacing this text.
CourseNana.COM
For the sake of consistency, we've provide our own JSON file, data/starinfo_2021_staff.json
, which you should use for the rest of the notebook. Load the contents of this JSON file and use it to create a Pandas DataFrame called frame
.
CourseNana.COM
Hint: Remember, the data structure in the JSON file is a list of dictionaries. CourseNana.COM
Points: 2.5 CourseNana.COM
frame = ...
# Take a peek
frame.head(20)
grader.check("q5")
Add a new column to your dataframe with the approximate age of each star at the time of their first credit. Name this new column age_at_first_credit
.
CourseNana.COM
NOTE: We say approximate age because we've only collected the year of birth as several star pages do not include a full birth date. The approximate age of a star in a given year should be the difference between that year and the star's birth year. CourseNana.COM
Points: 2.5 CourseNana.COM
# You should visually inspect some of your results
frame.head()
In this section you'll subset and sort the DataFrame to answer a pair of questions: CourseNana.COM
Which stars received their first credit before the age of 11? CourseNana.COM
Store the resulting dataframe as child_stars
sorted by age_at_first_credit
in ascending order.\ Store the number of such "child stars" in num_child_stars
.
CourseNana.COM
Points: 2.5 CourseNana.COM
child_stars = ...
num_child_stars = ...
print ("{} stars received their first credit before the age of 11.".format(num_child_stars))
display(child_stars)
grader.check("q7.1")
Which stars received their first credit at 26-years-old or older? CourseNana.COM
Store the resulting dataframe as late_bloomers
sorted by age_at_first_credit
in descending order.\ Store the number of such "late bloomers" in num_late_bloomers
.
CourseNana.COM
Points: 2.5 CourseNana.COM
late_bloomers = ...
num_late_bloomers = ...
print ("{} stars received their first credit at 26 or older.".format(num_late_bloomers))
display(late_bloomers)
grader.check("q7.2")
In this section you'll use your Python visualization skills to further explore the data: CourseNana.COM
Create 2 scatter plots: one showing the relationship between age at first movie and number of credits, the other between year born and number of credits. CourseNana.COM
What can you say about these relationships? Are there any apparent outliers? Please limit your written responses to 4 sentences or fewer. CourseNana.COM
Points: 5 CourseNana.COM
Type your answer here, replacing this text. CourseNana.COM
# your code here
Let's look at the distribution of movie and TV performers' ages by gender. CourseNana.COM
Create two plots, each plot consisting of two overlayed histograms comparing the distribution of men's current ages to women's current ages. CourseNana.COM
In the first plot, the distributions should be normalized to show the proportion of each gender at each age. CourseNana.COM
The second plot should show the counts of each gender at each age. CourseNana.COM
Interpret the resulting plots. (4 sentences or fewer) CourseNana.COM
NOTE 1: Again, we are dealing with approximate ages as defined above. CourseNana.COM
NOTE 2: You should exclude those whose
role
is not 'actor' or 'actress' from your analysis CourseNana.COM
Points: 10 CourseNana.COM
Type your answer here, replacing this text. CourseNana.COM