HDAG Interview Take Home Assignment - Data Analytics

Engage in a Conversation

HDAG Interview Take Home Assignment September 2022

Instructions

This take-home assignment is meant to evaluate your background and fit for a role within HDAG. Your first round interview and deliverable will be evaluated holistically, so please feel free to begin on your deliverable before your first round interview. It is meant to allow candidates to demonstrate proficiency in Python, data analysis and presentation skills. The assignment is open internet, and you are encouraged to use online resources; however, you may not consult anyone else while doing the assignment. CourseNana.COM

Please note: You are absolutely not expected to have experience in every area or be able to answer every question, especially in regards to the optional sections. CourseNana.COM

Some of the questions may seem open ended or ambiguous. This is a close match for the actual problems we work on with our case teams; we use data to extract meaningful and actionable insights. Do your best to come up with a realistic answer! CourseNana.COM

Duration

This assessment is expected to take around 120-180 minutes in total (depending on which open-ended section you choose to complete). CourseNana.COM

Submission

You can download your notebook by clicking File > Download. Submit your notebook with final answers and slides via this airtable link: https://airtable.com/shrLgp96L4gtiaIgb. CourseNana.COM

## 1. Python Functions

1.1 Write a Python function to square a number. Then use it to print the squares of the first 10 positive integers. CourseNana.COM

In [ ]:

def square(n):
  ### TODO: Your code for the function
  return ""


## TODO: Your code for printing the first 10 integer squares

1.2 Use a for loop to iterate over a list of words, filtering out words over a certain character length max_length. CourseNana.COM

In [ ]:

# example list of words
list_of_words = ["take", "home", "data", "analytics", "science", "programming"]

## desired output if max_length = 5
filtered_5 = ["take", "home", "data"]
## desired output if max_length = 9
filtered_7 = ["take", "home", "data", "analytics", "science"]


## TODO: Your code here

1.3 Write code to generate 1000 samples from a normal distribution with and plot a histogram of the distribution. Feel free to use any external libraries (ie. numpy, matplotlib) CourseNana.COM

Hint: Click the links to the packages above! CourseNana.COM

In [ ]:

def top_n(string, n):
    # TODO: """Your Code Here"""
    return ""

1.4 Write a function that takes in a string of words separated by spaces and and returns a dictionary with the top n most common words and their frequencies. If word frequencies are tied return any of them. Assume that case doesn't matter (a lower case and upper case word are considered the same word). CourseNana.COM

In [ ]:

# Example test case 
n = 3
posting = """
Herbal sauna uses the healing properties of herbs in combination with distilled water. 
The water evaporates and distributes the effect of the herbs throughout the room. 
A visit to the herbal sauna can cause real miracles, especially for colds. 
"""

output = {
    'the' :6, 
    'herbal':2, 
    'sauna':2
}

## 2. Domain Areas Choose **one** of the following questions to respond to--your answer should range from 1-2 paragraphs in length, but please answer the question as completely as possible.

2.1 Econometrics: How would you go about quantifying the social impact of a particular government program or policy? Explain how you would answer either of the following prompts: CourseNana.COM

How effective are lottery prizes at promoting vaccination? Do they justify the expense of the prize money?
A city has just added a public pre-k education program where none existed before. Explain how you could quantify the benefit.

For your chosen research question, what data would you look at, how would you analyze it and how might you present your results. CourseNana.COM

your answer here CourseNana.COM

2.2 Machine learning: Imagine you are tasked with creating a model to predict whether or not a user will click on an marketing email. What data would you want to have access to and how would you go about training a model? CourseNana.COM

your answer here CourseNana.COM

2.3 Natural Language Processing: Given a random paragraph of text, from a textbook explain how you would approach building a model to determine which type of textbook it came. Possible textbook types include biology, legal and mathematics. CourseNana.COM

your answer here CourseNana.COM

2.4 Consulting: Imagine your have access to a food chain's point of sale data (e.g time of transaction, items bought, customer id etc). How would you help the chain improve efficiency and attract more customers? What additional data would you want to accesss? CourseNana.COM

your answer here CourseNana.COM

3. Data Analysis

This question will allow you to demonstrate ability on analyzing a real-world dataset. You will work with the movie rating dataset or the college income dataset. Below are descriptions of the datasets. Pick one of three options - either 3.1 (i.e. 3.1.1 and 3.1.2), 3.2, or 3.3! CourseNana.COM

Please note that 3.2 and 3.3 are advanced options--you are encouraged but not expected to attempt them. CourseNana.COM

In [ ]:

# RUN THE FOLLOWING TO CLONE THE DATA REPO
! git clone https://github.com/harvardanalytics/fall22comp.git

Cloning into 'fall22comp'...
remote: Enumerating objects: 10, done.
remote: Counting objects: 100% (10/10), done.
remote: Compressing objects: 100% (9/9), done.
remote: Total 10 (delta 3), reused 4 (delta 1), pack-reused 0
Unpacking objects: 100% (10/10), done.

3.1.1. Visualization

This question will allow you to demonstrate ability on analyzing a real-world dataset. You will work with a US international air travel dataset. Below is a description of the dataset. CourseNana.COM

The data comes from the U.S. International Air Passenger and Freight Statistics Report. As part of the T-100 program, USDOT receives traffic reports of US and international airlines operating to and from US airports. There are two datasets available: CourseNana.COM

Departures: Data on all flights between US gateways and non-US gateways, irrespective of origin and destination. CourseNana.COM

Each observation provides information on a specific airline for a pair of airports, one in the US and the other outside. Three main columns record the number of flights: Scheduled, Charter, and Total. CourseNana.COM

Passengers: Data on the total number of passengers for each month and year between a pair of airports, as serviced by a particular airline. CourseNana.COM

U.S. International Air Passenger and Freight data are confidential for a period of 6 months, after which it can be released. As a result, quarterly reports and the year to date/calendar year raw data files available here will always lag by two quarters. CourseNana.COM

Run the code below to load the dataset. Explore the data to gain an understanding of the variables that exist. Then, create a barplot ranking the top 10 busiest airport. Think about how you would define "busiest". Make sure to label the plot axes and give it an appropriate title. CourseNana.COM

In [ ]:

# Run this command to download the data
import pandas as pd
departure_data = pd.read_csv("fall22comp/International_Report_Departures.csv")
passengers_data = pd.read_csv("fall22comp/International_Report_Passengers.csv")

In [ ]:

#TO DO: Create top 10 busiest airport barplot.

3.1.2. Open-Ended Data Analysis

This question will allow you to show your analysis skills. CourseNana.COM

Here are some ideas you could potentially explore: CourseNana.COM

Time-series analysis
Map of busiest airports
Flight paths
Correlation between deparatures and passengers
Distribution of larger and smaller airlines
Regional analysis
Incorporate outside datasets (population, weather, etc.)

Finally, prepare 1-2 slides presenting the final results of your analysis. You will be asked to explain your process and reasoning in the next interview round. Feel free to make use of this slide template: https://docs.google.com/presentation/d/1X-veqz2bfLep25h45kMQ2uVzkTwBAi8_icZe7Bu2JdE/edit?usp=sharing CourseNana.COM

In [ ]:

## TO DO: Your code for analysis!

3.2. NLP

This question will allow you to show your NLP skills. Please perform the following two tasks. CourseNana.COM

1) Write a classifier to identify sentiment based on the "label" column! 1 indicates negative, 2 indicates neutral, and 3 indicates positive sentiment. CourseNana.COM

2) Prepare a short presentation (1-2 slides) explaining the approach you used - what pre-processing, feature extraction, and classification methodologies - and why you used them. CourseNana.COM

In [ ]:

# Run this command to download the data
import pandas as pd
df = pd.read_csv("fall22comp/covid-19_vaccine_tweets_with_sentiment.csv", encoding="unicode_escape")

In [ ]:

#TODO: your classifier + analysis code goes here!

3.3. Machine Learning

This questions tests the following skills: binary prediction/classification (binary), feature engineering, an visualization. This is a prediction task to determine whether a person makes over $50K a year. CourseNana.COM

An individual’s annual income results from various factors. Intuitively, it is influenced by the individual’s education level, age, gender, occupation, and etc. This Adult dataset provided here contains 14 attributes and one target field, income, which is divided into two classes: <=50K and >50K CourseNana.COM

Load in the data and import any libraries you might need (e.g. numpy, pandas, sklearn). CourseNana.COM

In [ ]:

# Run this command to download the data
import pandas as pd
df = pd.read_csv("fall22comp/adult_data.csv")

Explore the data features. Make a side-by-side boxplot of the binary income categorization against one of the factors (e.g. age). Comment on any insights from this plot and note any potential outliers or unusual data points.

In [ ]:

# TODO: Visualize

Build a classifier to predict whether an individual's income is greater than or less than $50k. The following steps might be useful in building and reporting the results of this classifier:

Data preprocessing: explore missing data, outliers, NULL values, and scale any numeric fields
Train-test split
Choose a classification model and implement it.
Visualize the results of your classifier. Suggestions: confusion matrix, ROC curve

In [ ]:

# TODO: Build the classifier

Finally, prepare 1-2 slides presenting your classifier: which model you chose, how you processed data, and some visualizations and final results. You will be asked to explain your process and results in the next interview round. Feel free to make use of this slide template: https://docs.google.com/presentation/d/1X-veqz2bfLep25h45kMQ2uVzkTwBAi8_icZe7Bu2JdE/edit?usp=sharing CourseNana.COM

HDAG Interview Take Home Assignment - Data Analytics

HDAG Interview Take Home Assignment **September 2022**

Duration

Submission

3. Data Analysis

3.1.1. Visualization

3.1.2. Open-Ended Data Analysis

3.2. NLP

3.3. Machine Learning

Get in Touch with Our Experts

HDAG Interview Take Home Assignment September 2022