Causal Data Science for Business Analytics
Hamburg University of Technology
Monday, 24. June 2024
Informs action
:
Understanding
:
Conditional Average Treatment Effect (CATE)
:
Randomized experiments
- no confounders:
Measured Confounding
- distinguish two types of CATEs:
Group ATE (GATE)
for groups G defined by H: \(\tau(\mathbf{g}) = \mathbb{E}[Y_i(1) - Y_i(0) | \mathbf{G_i = g}]\)Individualized ATE (IATE = CATE)
: \(\tau(\mathbf{x}) = \mathbb{E}[Y_i(1) - Y_i(0) | \mathbf{X_i = x}]\)
\[ \begin{align*} \mathbb{E}[Y_i(1) - Y_i(0) | f(\mathbf{X_i}) = f(\mathbf{x})] &= \mathbb{E}[\mathbb{E}[Y_i(1) - Y_i(0) | \mathbf{X_i = x}, f(\mathbf{X_i = x}) ] | f(\mathbf{X_i = x})] \\ &= \mathbb{E}[\mathbb{E}[Y_i(1) - Y_i(0) | \mathbf{X_i = x} ] | f(\mathbf{X_i = x})] \end{align*} \]
Group ATE (GATE)
: \(\tau(\mathbf{g}) = \mathbb{E}[Y_i(1) - Y_i(0) | \mathbf{G_i = g}]\)Previous lecture: ATE (AIPW) can be estimated as mean of a pseudo-outcome:
Pseudo-outcome is given by:
\[ \begin{align*} \mathbb{E}[\tilde{\tau_i}_{\text{ATE}}^{\text{AIPW}} \mid G_i = g] &= \mathbb{E} \left[ \mu(1,\mathbf{X_i}) + \frac{T_i(Y_i - \mu(1,\mathbf{X_i}))}{e(\mathbf{X_i})} - \mu(0,\mathbf{X_i}) - \frac{(1-T_i)(Y_i - \mu(0,\mathbf{X_i}))}{1 - e(\mathbf{X_i})} \bigg| G_i = g \right] \\ &\overset{LIE}{=} \mathbb{E} \left[ \underbrace{\mathbb{E} \left[ \mu(1, \mathbf{X_i}) + \frac{T_i(Y_i - \mu(1, \mathbf{X_i}))}{e(\mathbf{X_i})} \mid \mathbf{X_i = x} \right]}_{\text{CAPO-AIPW => }\mathbb{E}[Y_i(1) \mid \mathbf{X_i = x}]} - \underbrace{\mathbb{E} \left[ \mu(0, \mathbf{X_i}) + \frac{(1-T_i)(Y_i - \mu(0, \mathbf{X_i}))}{1 - e(\mathbf{X_i})} \mid \mathbf{X_i = x} \right]}_{CAPO-AIPW => \mathbb{E}[Y(0) \mid \mathbf{X_i = x}]} \bigg| G_i = g \right] \\ &= \mathbb{E}\left[\mathbb{E}[Y_i(1) \mid \mathbf{X_i = x}] - \mathbb{E}[Y_i(0) \mid \mathbf{X_i = x}] \bigg| G_i = g\right] \\ &\overset{LIE}{=} \mathbb{E}[Y_i(1) - Y_i(0) \mid G_i = g] \\ &= \tau(g) \end{align*} \]
# Get the indvidual ATEs (pseudo-outcomes)
data$ate_i <- dml_irm_forest[["psi_b"]] # get numerator of score function, which is equal to pseudo outcome
mean_ate = mean(data$ate_i) # mean of pseudo outcomes = ATE
library(estimatr) # for linear robust post estimation
summary(lm_robust(ate_i ~ hown, data = data))
Estimates and significance testing of the effect of target variables
Estimate. Std. Error t value Pr(>|t|)
e401 8206 1106 7.421 1.16e-13 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Call:
lm_robust(formula = ate_i ~ hown, data = data)
Standard error type: HC2
Coefficients:
Estimate Std. Error t value Pr(>|t|) CI Lower CI Upper DF
(Intercept) 3477 711 4.890 1.025e-06 2083 4870 9913
hown 7445 1835 4.058 4.990e-05 3849 11041 9913
Multiple R-squared: 0.00106 , Adjusted R-squared: 0.0009588
F-statistic: 16.47 on 1 and 9913 DF, p-value: 4.99e-05
Individualized ATE (IATE = CATE)
: \(\tau(\mathbf{x}) = \mathbb{E}[Y_i(1) - Y_i(0) | \mathbf{X_i = x}]\)S-learner
:
full sample
: \(\hat{\mu}(T_i; \mathbf{X_i})\).T-learner
:
treated subsample
.control subsample
.library(hdm) # for the data
library(grf) # generalized random forests, could also use mlr3
# Get data
data(pension)
# Outcome
Y = pension$net_tfa
# Treatment
T = pension$p401
# Create main effects matrix
X = model.matrix(~ 0 + age + db + educ + fsize + hown + inc + male + marr + pira + twoearn, data = pension)
# Implement the S-Learner
TX = cbind(T,X)
rf = regression_forest(TX,Y)
T0X = cbind(rep(0,length(Y)),X)
T1X = cbind(rep(1,length(Y)),X)
cate_sl = predict(rf,T1X)$predictions - predict(rf,T0X)$predictions
hist(cate_sl)
do not know of joint goal
to approximate a difference:
should aim to minimize
:\[ \begin{align*} \text{MSE}(\hat{\tau}(\mathbf{x}))) &= \mathbb{E}[(\hat{\tau}(\mathbf{x})) - \tau(\mathbf{x})))^2] \\ &= \mathbb{E}[(\hat{\mu}(1, \mathbf{x})) - \hat{\mu}(0, \mathbf{x})) - (\mu(1, \mathbf{x})) - \mu(0, \mathbf{x}))))^2] \\ &= \mathbb{E}[(\hat{\mu}(1, \mathbf{x})) - \mu(1, \mathbf{x})))^2] + \mathbb{E}[(\hat{\mu}(0, \mathbf{x})) - \mu(0, \mathbf{x})))^2] \\ &\quad - 2\mathbb{E}[(\hat{\mu}(1, \mathbf{x})) - \mu(1, \mathbf{x})))(\hat{\mu}(0, \mathbf{x})) - \mu(0, \mathbf{x})))] \\ &= \text{MSE}(\hat{\mu}(1, \mathbf{x}))) + \text{MSE}(\hat{\mu}(0, \mathbf{x}))) - 2\text{MCE}(\hat{\mu}(1, \mathbf{x})), \hat{\mu}(0, \mathbf{x}))) \end{align*} \]
Mean Correlated Error (MCE)
: correlated errors matter lessModify
supervised ML methods to target causal effect estimation
Combine
supervised ML methods to target causal effect estimation
Metalearners
, e.g.:
Estimate nuisance parameters
using suitable ML method.minimization problem
targeting CATE.Solve the minimization problem
using suitable ML method.Predict
CATE using the model learned in 3.Partially linear model
, but now allowing for treatment effects that vary with \(\mathbf{X}\):
\[ \begin{align*} \hat{\beta}_{RL} &= \underset{\beta}{\operatorname{arg\,min}} \sum_{i=1}^N( Y_i - \hat{\mu}(\mathbf{X_i}) - \mathbf{\beta'} \underbrace{(T_i - \hat{e}(\mathbf{X_i})) \mathbf{X_i}}_{=\mathbf{\tilde{X}_i}})^2 \\ &= \underset{\beta}{\operatorname{arg\,min}} \sum_{i=1}^N \left( Y_i - \hat{\mu}(\mathbf{X_i}) - \mathbf{\beta'} \mathbf{\tilde{X}_i} \right)^2 \end{align*} \]
\(\mathbf{\tilde{X}_i} = (T_i - \hat{e}(\mathbf{X_i})) \mathbf{X_i}\) are modified / pseudo-covariates.
\(\hat{\tau}_{\text{RL}}(\mathbf{x}) = \mathbf{\hat{\beta}_{RL} x} \neq \mathbf{\hat{\beta}_{RL} \tilde{x}}\) is the estimated CATE for a specific \(\mathbf{x}\).
All linear shrinkage estimators (Lasso and friends) can be applied, nuisance parameters can still be estimated with non-linear ML.
\[ \begin{align*} \hat{\tau}_{\text{RL}}(\mathbf{x}) &= \arg \min_{\tau} \sum_{i=1}^n ( Y_i - \hat{\mu}(\mathbf{X_i}) - \tau(\mathbf{X_i}) ( T_i - \hat{e}(\mathbf{X_i})))^2 \\ &= \arg \min_{\tau} \sum_{i=1}^n \frac{( T_i - \hat{e}(\mathbf{X_i}))^2}{(T_i - \hat{e}(\mathbf{X_i}))^2}(Y_i - \hat{\mu}(\mathbf{X_i}) - \tau(\mathbf{X_i}) ( T_i - \hat{e}(\mathbf{X_i})))^2 \\ &= \arg \min_{\tau} \sum_{i=1}^n (T_i - \hat{e}(\mathbf{X_i}))^2 \bigg(\frac{Y_i - \hat{\mu}(\mathbf{X_i}) - \tau(\mathbf{X_i}) ( T_i - \hat{e}(\mathbf{X_i}))}{ T_i - \hat{e}(\mathbf{X_i})}\bigg)^2 \\ &= \arg \min_{\tau} \sum_{i=1}^n (T_i - \hat{e}(\mathbf{X_i}))^2 \bigg(\frac{Y_i - \hat{\mu}(\mathbf{X_i})}{ T_i - \hat{e}(\mathbf{X_i})} - \tau(\mathbf{X_i})\bigg)^2 \\ \end{align*} \]
Recall the pseudo-outcome of the AIPW-ATE from previous lecture and condition on \(\mathbf{X_i}\) (same “trick” as for GATE estimation):
DR-learner
by Kennedy (2020) uses \(\tilde{\tau_i}_{\text{ATE}}^{\text{AIPW}}\) in a generic ML problem:
Cross-fitting
: in 4 subsamples (1) train a model for \(\hat{e(\mathbf{X_i})}\), (2) train a model for \(\hat{\mu(\mathbf{X_i})}\), (3) construct \(\tilde{\tau_i}_{\text{ATE}}^{\text{AIPW}}\) and regress on \(\mathbf{X_i}\), (4) predict \(\hat{\tau}_{RL}(\mathbf{x})\). Then rotate.library(hdm) # for the data
library(causalDML) # generalized random forests, could also use mlr3
# Get data
data(pension)
# Outcome
Y = pension$net_tfa
# Treatment
T = pension$p401
# Create main effects matrix
X = model.matrix(~ 0 + age + db + educ + fsize + hown + inc + male + marr + pira + twoearn, data = pension)
# Implement the DR-Learner
dr = dr_learner(Y,T,X,
ml_w = list(create_method("forest_grf")),
ml_y = list(create_method("forest_grf")),
ml_tau = list(create_method("forest_grf"))
)
# DR-learner distribution of B-A
hist(dr$cates[,1])
Descriptive
: histogram, kernel density plots, box plots, etc. …Inference
: test whether effect heterogeneity is systematic or just noise.BLP
).GATES
).CLAN
) to explore what drives the heterogeneous effects.hypothetical regression of the true CATE on the demeaned predicted CATE
:Definition “Best Linear Predictor (BLP)”
The best linear predictor of \(\tau(\mathbf{X_i})\) by \(\hat{\tau}(\mathbf{X_i})\) is the solution to:
\((\beta_1, \beta_2) = \underset{\tilde{\beta_1}, \tilde{\beta_2}}{\operatorname{arg\,min}} \space \mathbb{E} \left[ \left( \tau(\mathbf{X_i}) - \tilde{\beta_1} - \tilde{\beta_2} \left( \hat{\tau}(\mathbf{X_i}) - \mathbb{E}[\hat{\tau}(\mathbf{X_i})] \right) \right)^2 \right]\)
\(\beta_2 = \frac{\text{Cov}[\tau(\mathbf{X_i}), \hat{\tau}(\mathbf{X_i})]}{\text{Var}[\hat{\tau}(\mathbf{X_i})]} = 1\) if \(\hat{\tau}(\mathbf{X_i}) = \tau(\mathbf{X_i})\) (what we would like to see)
\(\beta_2 = 0\) if \(\text{Cov}[\tau(\mathbf{X_i}), \hat{\tau}(\mathbf{X_i})] = 0\), which can have two reasons
:
constant
(no heterogeneity to detect).not constant
but the estimator is not capable of finding it (bad estimator and/or not enough observations).Therefore, testing \(H_0: \beta_2 = 0\) is a joint test
of
Strategy A: Weighted residual BLP
\((\beta_1, \beta_2) = \underset{\tilde{\beta_1}, \tilde{\beta_2}}{\operatorname{arg\,min}} \space \mathbb{E} \left[ \omega(\mathbf{X_i}) \left( Y_i - \tilde{\beta_1}(T_i - e(X_i)) - \tilde{\beta_2} (T_i - e(X_i)) \left( \hat{\tau}(\mathbf{X_i}) - \mathbb{E}[\hat{\tau}(\mathbf{X_i})] \right) \color{#005e73}{- \alpha \mathbf{X^{C}_{i}}} \right)^2 \right]\)
where:
See Appendix A in Chernozhukov et al. (2017-2023) for a detailed derivation.
Strategy B: Horvitz-Thompson BLP
\((\beta_1, \beta_2) = \underset{\tilde{\beta_1}, \tilde{\beta_2}}{\operatorname{arg\,min}} \space \mathbb{E} \left[ \left( H_iY_i - \tilde{\beta_1} - \tilde{\beta_2} \left( \hat{\tau}(\mathbf{X_i}) - \mathbb{E}[\hat{\tau}(\mathbf{X_i})] \right) \color{#005e73}{- \alpha H_i \mathbf{X^{C}_{i}}} \right)^2 \right]\)
where:
pseudo-outcome
.See Appendix A in Chernozhukov et al. (2017-2023) for a detailed derivation.
Idea
:
Definition “Sorted Group Average Treatment Effect (GATES)”
\(\gamma_k := \mathbb{E}[ \tau(\mathbf{X_i}) | G_k]\), \(k = 1, \ldots, K\)
Strategy A: Weighted residual GATES
\((\gamma_1, \dots, \gamma_K) = \underset{\tilde{\gamma}_1, \dots, \tilde{\gamma}_K}{\operatorname{arg\,min}} \space \mathbb{E} \left[ \omega(\mathbf{X_i}) \left( Y_i - \sum_k\tilde{\gamma_k}(T_i - e(X_i))\mathbb{1}[G_k] \color{#005e73}{- \alpha \mathbf{X^{C}_{i}}} \right)^2 \right]\)
Strategy B: Horvitz-Thompson GATES
\((\gamma_1, \dots, \gamma_K) = \underset{\tilde{\gamma}_1, \dots, \tilde{\gamma}_K}{\operatorname{arg\,min}} \space \mathbb{E} \left[ \left( H_iY_i -\sum_k \tilde{\gamma_k}\mathbb{1}[G_k] \color{#005e73}{- \alpha H_i \mathbf{X^{C}_{i}}} \right)^2 \right]\)
pseudo-outcome
and \(H_i = \frac{T_i-e(\mathbf{X_i})}{e(\mathbf{X_i})(1-e(\mathbf{X_i}))}\) being the Horvitz-Thompson (IPW) weights.\(\color{#005e73}{\mathbf{X^{C}_{i}}}\) is not required for identification, but contains optional functions of \(X_i\) to reduce estimation noise, e.g. \([1,\hat\mu(0,\mathbf{X_i}), e(\mathbf{X_i}), e(\mathbf{X_i})\hat{\tau}(\mathbf{X_i})]\)
See Appendix A in Chernozhukov et al. (2017-2023) for a detailed derivation.
Classification Analysis (CLAN) can be implemented by simple mean comparisons of covariates in extreme GATES groups:
Definition “Classification Analysis (CLAN)”
Classification Analysis (CLAN) compares the covariate values of the least affected group
G1 with the most affected group
GK defined for the GATES:
where
R package GenericML
by Welz, Alfons, Demirer, and Chernozhukov (2022).
Algorithm
:
IN
: \(\text{Data} = (Y_i, \mathbf{X_i}, T_i )^{N}_{i=1}\), significance level \(\alpha\), a suite of ML methods, number of splits \(S\).
OUT
: \(p-\text{values}\) and \((1-2\alpha)\) confidence intervals of point estimates of each target parameter in GATES, BLP, and CLAN.
CATE Prediction Methods
:
HET Evaluation
:
Notation
:
Additional notation
:
Goal
: Find the optimal policy \(\pi^*\) that maximizes the value function \(Q(\pi)\).Fundamental problem of causal inference
: counterfactuals unknown.Problem
: minimizing \(\text{MSE}_{\text{CATE}} = \mathbb{E}[(\hat{\tau}(\mathbf{x}) - \tau(\mathbf{x})^2]\) does not necessarily improve downstream policy rule learning (Qian & Murphy, 2011).benchmark policy
that assigns treatments via fair coin flip
:
\[ \begin{align*} \pi^* &= \arg \max_{\pi} Q(\pi) = \arg \max_{\pi} \mathbb{E}[Y(\pi)] = \arg \max_{\pi} \mathbb{E}[Y(\pi) \color{#00C1D4}{- 0.5 \mathbb{E}[Y(1)] + 0.5 \mathbb{E}[Y(0)]}] \\ &= \arg \max_{\pi} \mathbb{E}[\pi Y(1) + (1 - \pi) Y(0)] - 0.5 \mathbb{E}[Y(1)] - 0.5 \mathbb{E}[Y(0)] \\ &= \arg \max_{\pi} \mathbb{E}[(\pi - 0.5) Y(1)] + \mathbb{E}[(0.5 - \pi) Y(0)] = \arg \max_{\pi} \mathbb{E}[(\pi - 0.5) (Y(1) - Y(0))] \\ &= \arg \max_{\pi} {\color{#00C1D4}2} \mathbb{E}[(\pi - 0.5) (Y(1) - Y(0))] \\ &= \arg \max_{\pi} \mathbb{E}[(2\pi - 1)(Y(1) - Y(0))] \\ &\overset{LIE}{=} \arg \max_{\pi} \mathbb{E}[(2\pi - 1) \tau(\mathbf{X_i})] \\ &= \arg \max_{\pi} \underbrace{\mathbb{E}[|\tau(\mathbf{X_i})| \text{sign}(\tau(\mathbf{X_i})) (2\pi(\mathbf{X_i}) - 1)]}_{A(\pi)} \\ \end{align*} \]
advantage
of a policy compared to random allocation:
earn the absolute value of the CATE
.lose the absolute value of the CATE
.get it right for those with biggest CATEs
, those with CATEs close to zero are negligible.difference to CATE MSE minimization
, where we need to find good approximations everywhere.pseudo-outcome
(again) because of all the nice properties:
Binary weighted classification problem
: classify the sign of the CATE while favoring correct classifications with larger absolute CATEs.
# Load required packages
library(mlr3)
library(mlr3learners)
library(tidyverse)
library(rpart.plot)
# Get and transform the indvidual ATEs (pseudo-outcomes)
data$ate_i <- dml_irm_forest[["psi_b"]] # get numerator of score function, which is equal to pseudo outcome
data$sign <- sign(data$ate_i) # get sign of pseudo outcomes
data$weights <- abs(data$ate_i) # get weights (absolute value of pseudo outcomes)
# Optimal Policy Learning with Classification Tree
data <- data[,c("sign", "weights", "age", "db", "educ", "fsize",
"hown", "inc", "marr", "pira", "twoearn")] # select columns
task <- as_task_classif(data, target = "sign") # task
task$set_col_roles("weights", roles = "weight") # define weights
lrnr <- lrn("classif.rpart") # learner decision tree
lrnr$train(task) # train
rpart.plot(lrnr$model, yesno = 2) # plot
# Classification Analysis
data$pi <- 2 * (as.numeric(lrnr$predict(task)$response) - 1.5) # Output takes values 1 and 2, therefore recode to -1/1
data_x <- data[, -c(1:2)] # remove sign and weights
CLAN = cbind(colMeans(data_x[pi == -1,]),colMeans(data_x[pi == 1,])) # calculate CLAN
colnames(CLAN) = c("No 401(k)","401(k)") # rename columns
round(CLAN,2) # print
No 401(k) 401(k)
age 45.38 41.03
db 0.00 0.27
educ 16.02 13.19
fsize 2.95 2.87
hown 0.97 0.63
inc 138411.00 36532.74
marr 0.97 0.60
pira 0.72 0.24
twoearn 0.68 0.38
pi -1.00 1.00
Thank you for your attention! | |
Causal Data Science: (6) Heterogeneous Treatment Effects