MGSC 416, Winter 2023 Data-driven models for Operations Analytics
Problem Set 2-Individual Assignment
Please submit your R code (R script or RMarkdown file) with comments. You also should submit a word/pdf document where you summarize your findings and answer each question. Please paste the necessary supporting graphs/tables from R in your document (or use a RMarkdown pdf).
Problem: Enron email network (40pts) In this assignment, we explore the email network of Enron during their investigation. The objective is to use and discuss different network metrics to test a hypothesis. There are two files: • nodes email.csv : this files contains the list of nodes of the graph where each node represents an email address. • edges email.csv : this files contains the list of edges of the graph where an edge is present from A to B if an email was sent from A to B. The weight on the edge represents how many emails were sent. The goal of the problem is to narrow down a short list of people that we should investigate first. A good investigator forms many hypotheses and goes through them. If some people show up under various hypothesis, that’s even more reason to flag them as suspicious.
Note: please use edge.arrow.size=0.1 and vertex.label=NA for all your network plots.
0.1 Step 1: the number of received emails (16pts)
We can look up statistics on the number of emails people receive and send. It is suspected that employees who receive many emails have higher chanced of being implicated in the fraud.
- Who are the top 8 employees with the most received emails? (4pts).
- Let us calculate for each employee the ratio of the number of received emails over the number of emails sent. Make a scatter plot with the following information: • x-axis: number of email sent. • y-axis: ratio of email received over sent. • Size of point: total number of emails exchanged (sent and received) Make sure your graph and axes are properly titled. (5pts)
- We will focus on employees who have sent at least 10 emails and have a ratio of received emails vs sent that is at least 1.5. (a) Who are these suspected employees? (2pts) (b) We want to highlight the list of these suspects in the network by changing their color and size. Plot the network where the suspected employees are represented by red nodes with size = 7 and the rest are represented by black nodes of size = 3. (5pts)
0.2 Step 2: filter out obvious non-suspects (7pts)
Actually in the dataset, some emails are outside the Enron domain, meaning they do not end in enron.com. We will filter out these emails from our dataset and network.
- Plot the network where we highlight the emails that are outside of Enron domain. We will use the color blue and size=7 for these nodes and the color black and size=3 for the rest. In particular, the function grepl( ’enron.com’,email data$name), returns a true/false array that we can use to subset the data. (5pts)
- Plot a new network where these nodes are deleted. (2pts)
0.3 Step 3: Compare metrics (17pts)
From now on we look at the filtered dataset of only employees with an Enron email (make sure to only consider this subset of your emails), and the resulting network from Q2.2. Next we will only keep emails that are in the largest connected component of a the network: We first inspect the membership value of each employee in the network (for example named net) (i.e. which connected component they belong to). Then we only keep the ones that belong to the largest one. We recover the induced subgraph:
This is a necessary step for calculating metrics for centrality. All metrics will be calculated for the resulting sub-graph, and the associated emails to the subgraph. Note. To calculate quantiles, you can use the function quantile(x,p) that calculates the percentile p in a vector x.
- Calculate the closeness centrality for these employees. Who are the employees that are above the 96% percentile in terms of closeness centrality? (4pts)
- Calculate the betweenness centrality for these employees. Who are the employees that are above the 96% percentile in terms of betweenness centrality ? (4pts)
- We want to highlight all of these employees in the plot of the network. Make the plot with the following highlights: we will use the color blue for the top 4% in only closeness, the color red for the top 4% in only betweenness, the color green for those that rank in the top 4% for both metrics, and the color black for the rest of nodes. All the nodes that are not black will have a size of 7, the black nodes will have a size 3. Who are your top suspect employees that rank in the top 4% for both closeness and betweenness? (9pts)