Homework 3-115 points
General Instructions
This homework must be turned in on Gradescope by August 4th 2022, 11:59pm. It must be your own work, and your own work only—you must not copy anyone’s work, or allow anyone to copy yours. This extends to writing code. You may consult with others, but when you write up, you must do so alone. Your homework submission must be written and submitted using Rmarkdown. No handwritten solutions will be accepted. You should submit:
- A compiled PDF file named yourNetID solutions.pdf containing your solutions to the problems.
- A .Rmd file containing the code and text used to produce your compiled pdf named yourNetID solutions.Rmd. Note that math can be typeset in Rmarkdown in the same way as Latex.
Please make sure your answers are clearly structured in the Rmarkdown file:
- Label each question part(e.g. 3.a).
- Do not include written answers as code comments.
- The code used to obtain the answer for each question part should accompany the written answer.
Problem 1 - CATE using GOTV 20 points
Consider again the GOTV data from last problem set by Gerber, Green and Larimer (APSR, 2008). Although it is not specified in the paper, it is highly possible that the authors created subgroups based on the turnout history for 5 previous primary and general elections (number of times the individual voted), and number of registered voters in the household. In this problem, we will create subgroups based on the turnout history, and investigate the CATE(conditional average treatment effect) and the effect modifications in each subgroup. We denote the turnout history/number of times voted as a covariate Xi for individual i.
Part a. Data preparation (5 points):
Construct a new dataset for this problem using individual dataset from the last problem set.
- Create a new column num voted to represent the number of times the individual has voted in previous 5 elections by summing the variables g2000, p2000, g2002, p2002 and p2004 (exclude g2004 because the experiment filtered out people who didn’t vote in g2004), the resulting column should be an integer ranging from [0,5]
- In the following problems, we are using the individual data with num voted as different sub- groups. To simplify the problem, we investigate only the ”Neighbor” treatment effect. Con- struct a cleaner dataset with {id, hh id, hh size, num voted, voted, treatment} as columns and filter out treatment groups besides {Neighbor, Control}.
- Construct a household-level dataset by taking the means of hh size, num voted, and voted in each household (the other variables are all equal within the same household and can simply be left as they are). Round the mean of num voted up to the nearest integer. Your result- ing dataset should have one household per row, and hh id, hh size, num voted, voted, and treatment as columns. The variable num voted should have only values 0, 1, 2, 3, 4, 5.
- Report number of households in each subgroup for both treatment and control, what do you observe?
Part b. CATE for subgroups (6 points)
We define conditional average treatment effect as the ATE for different subgroups defined by the ”num voted” variable:
τ(x)=E[Yi(1)−Yi(0)∣Xi =x],x∈{0,1,2,3,4,5}
Since treatment was randomized at the household level, positivity and ignorability hold both unconditionally, and conditionally, within each subgroup. For each subgroup:
1. Estimate the CATE and report the variance of your estimates.
2. Construct a 95% confidence interval around your estimates.
3. What conclusions can you draw from these statistics?
You can skip subgroups that either do not have members in them or do not have any treated/control members.
Part c. Effect modification (6 points)
Suppose we want to estimate whether there is a difference in effects for two extreme groups, individuals who always vote(Xi = 5) and individuals who never vote(Xi = 0), we construct an estimator
- Calculate the variance of δ and construct a 95% confidence interval around it, can we say that there’s significant difference in the treatment effect for people who always vote and people who never vote?
- Combine your observations with conclusions from part b, comment about your findings.
Part d. Sample sizes and significance effect (3 points)
In the experiment, the authors claimed no significant differences between groups, one possible reason may be that the sample size for each subgroup is too small. This is a practical problem we may encounter in experimental designs when we are testing multiple hypothesis or we are having too many subgroups. Explain in your own words why having more hypothesis/subgroups would make significant effect harder to detect for each group, assuming the overall sample size is fixed.
∆to estimate the difference. As we saw in class, we can estimate this difference as: ∆ˆ = τˆ(0) − τˆ(5)
Problem 2 - 15 points
In this question we will be using the same household-level dataset that you constructed in part a of Problem 1.
Part a (4 points):
Compute the ATE of the ”Neighbors” treatment using the standard difference-in-means estimator,
Part b (5 points):
Now compute the same ATE but with the stratification estimator that is defined as the weighted mean of the stratum CATEs that you computed in the previous problem: i.e., τˆ = Yt − Yc. Provide standard errors and 95% confidence intervals for your estimates.
estimator defined as:
Nx τ̂ =∑τˆ(x) .
Compute variance and 95% confidence intervals for this estimator as well using the stratified variance
Var (τ ) = ∑ Var(τ (x)) ( )
Comment on the difference between the ATE estimates you obtained here and in part a and their variances. What is it due to?
Part c (6 points):
Now Divide the data set into 6 strata in such a way that each of the strata have same proportion of Treated and Control observations. You can do so by creating a new variable called ”group” with values 0, 1, 2, 3, 4, 5 and randomly assigning each value to Nt/6 treated units and Nc/6 control units. You may exclude enough treated and control units from the data to make Nt and Nc divisible by 6.
Compute the ATE by applying the estimator τˆ to these newly created strata. Provide variance block estimates and 95% confidence intervals for these ATE estimates as well using the stratified variance estimator. Is the variance of this estimator much different from that of τˆ you computed in part A? Why do you think this is the case?
Problem 3 25 points
Consider a study with N units. Each unit i in the sample belongs to one of G mutually exclusive strata. Gi = g denotes that the ith unit belongs to stratum g. Ng denotes the size of stratum g and Nt,g denotes the number of treated units in that stratum. Suppose that treatment is assigned via block-randomization. Within each stratum, Nt,g units are randomly selected to receive treatment and the remainder receive control. Suppose that the proportion of treated units in each stratum, Nt,g is not the same for all strata. After treatment is assigned, you record an outcome Yi for each Ng unit in the sample. Assume consistency holds with respect to the potential outcomes: Yi = DiYi(1) + (1 − Di)Yi(0)
Part a (5 points)
Show that the ATE: τ = E[Yi(i) − Yi(0)] is is identified in this setting, i.e., show that τ equal to a function of the observed outcomes.
Part b (10 points)
estimator:
G Ng τ̂ = ∑ τ̂ ( g ) N
g=1 is unbiased for the ATE, i.e., show that E[τ̂] = τ:
Part c (10 points)
Instead of using the stratified difference-in-means estimator, your colleague suggests an alternative that assigns a weight to each unit and takes two weighted averages. Let w(Gi) = Pr(Di = 1∣Gi) denote the known (constant) probability that unit i would receive treatment given its stratum membership Gi. The new estimator is:
Assume that E[τ̂(g)∣G = g,N = n ] = τ(g) and that E[Ng ] = Pr(G = g). Show that the stratified
Problem 4 - Directed Acyclic Graphs (DAGs) 15 points Consider the following Directed Acyclic Graph:
N i=1 w(Gi) 1 − w(Gi)
Assuming that E[Ng ] = Pr(G = g), show that τ̂ is unbiased i.e., show that E[τ̂ ] = τ.
Note: either showing that τ̂ is unbiased for τ = E[Y (1)−Y (0)] or for τ = 1 ∑N E[Y (1)−Y (0)]
w ii Ni=1ii will count as a valid answer.
Part a (5 points)
Of the five variables in the graph, 2 are colliders and 3 are non colliders. Which variables are colliders and which are non-colliders?
Part b (5 points)
Suppose that we wanted to estimate the effect of A on Y . Indicate if we should or should not condition on X, and explain why, and indicate if we should or should not condition on Z and explain why.
Part c (5 points)
Suppose that we wanted to estimate the effect of M on Y . List all the backdoor paths between M and Y, and indicate which variable we should condition on to close each path. There may be multiple valid options for each path.