3 Causal Inference: A Practical Approach
Having laid the foundation with the potential outcome framework and fundamental causal concepts, we now delve into the world of causal modeling. In this chapter, we explore the power of causal graphs as a comprehensive approach for inferring causal relationships. Firstly, we introduce causal graphs, explaining the basics of graph modeling in the context of causal inference. Next, we offer an overview of the high-level process involved in causal inference using causal graphs. Subsequently, we discuss each stage of the causal inference process, providing a more detailed examination of the methods and techniques commonly employed. Finally, we present a comprehensive case study utilizing the Lalonde dataset. This study compares and contrasts the various techniques for causal inference discussed, accompanied by a thorough analysis.
3.1 Causal Inference: Logical Flow
In the last chapter, we introduced the causality flowchart for estimating the causal effect, as shown in Figure 3.1.
The two pertinent questions will be answered in this chapter, as shown in Figure 3.2.
- How do we perform identification and convert causal estimands into statistical estimands? What are the tools available for this process?
The identification and conversion of causal estimands to statistical estimands are achieved through causal modeling.
- Once we have obtained statistical estimands, how do we proceed with estimation? What tools can be utilized for this purpose?
Estimation is carried out by leveraging the statistical estimands derived from the causal modeling process and utilizing observational data in conjunction with appropriate statistical techniques.
Causal models can be constructed either by experts with domain knowledge or through automated procedures known as causal discovery (discussed in Chapter 4). Building causal models requires domain knowledge, identifying relevant variables, and determining the relationships between them.
3.2 Causal Inference: Practical Flow
In the previous section, we examined the logical flowchart that illustrates transitioning from a target causal estimand to an estimate. Now, we will delve into the practical aspect of this process, providing a high-level overview of the steps involved in practice.
This section aims to bridge the gap between theory and application, offering insights into how the causal inference process unfolds in real-world scenarios.
Notably, unlike the machine learning process, the causal inference process involves distinct steps, as Dow et al. underscored in their work (Sharma and Kiciman 2020) and given by Figure 3.3.
To initiate the causal modeling process, one of the methodologies employed is constructing a causal graph with appropriate structural assumptions. Causal graphs depict the causal structure by utilizing nodes to represent variables and edges to represent the causal relationships between them.
After establishing the causal assumptions within a causal model, the subsequent stage of causal analysis is identification. During this stage, the objective is to examine the causal model, encompassing the relationships between variables and the observed variables, to ascertain if there is sufficient information to address a particular causal inference question. Different techniques are utilized with the aim of converting the causal estimands into statistical estimands.
After confirming the estimability of the causal effect and transforming the causal estimands into statistical estimands, the third step revolves around estimating the effect using suitable statistical estimators. Different estimation techniques, such as regression models or propensity score matching, can be utilized to estimate the causal effect.
Finally, the fourth step involves validating and assessing the robustness of the obtained estimate through rigorous checks and sensitivity analyses. This step includes examining the sensitivity of the estimated effect to different model specifications, testing the robustness of the results against potential sources of bias or unobserved confounding, and assessing the generalizability of the findings.
3.3 Causal Modeling
The initial phase of any causal inference effort entails constructing causal models, graphical models in our case, which encode domain understanding and assumptions. A well-designed causal model should capture most of the relationships between the outcome and variables and inter-relationships between the variables.
Causal graphs are a common way of representing the relationships between variables when modeling the joint distribution.
A cause in a directed acyclic graph (DAG), similar to Bayesian networks, is defined as if any changes are made to a node, a response, or corresponding modifications are observed in the connected node(s). In the context of graphical models, a causal graph is a type of Bayesian network in which each node’s direct causes are represented by its parents.
For a comprehensive and in-depth exploration of causal graphs, we highly recommend referring to Judea Pearl’s work (Pearl 2009, 2000).
3.3.1 Assumptions in Causal Modeling
Figure 3.4 shows the entire flow of assumptions that help in causal modeling.
- Local Markov Assumption: Given the parents in the DAG, a node is independent of all its non-descendants.
- Minimality assumption: The adjacent nodes in the DAG are dependent and have to be considered in the factorization in addition to the local Markov assumption. This assumption removes independent assumptions when the edges are present in the DAG.
- Causal Edge assumption: In causal relationships, every parent is the direct cause of their children.
Thus for a given distribution represented by DAGs, the local Markov assumption helps to measure statistical independence, the minimality assumption helps to focus on statistical dependencies (at least between the adjacent nodes), and finally, layering the causal edge assumption gives us the causal dependencies.
Causal graphs possess an essential characteristic of being able to identify confounding variables. These variables are associated with the cause and the effect but are not causally connected. For instance, in the case of smoking and lung cancer, age may act as a confounding variable related to both smoking and lung cancer but does not lie on the causal pathway between them. By including age as a node in the causal graph, researchers can control for its effects and isolate the causal relationship between smoking and lung cancer.
Causal graphs can identify the minimal set of variables, which are imperative in estimating the causal effect of one variable on another. This technique is known as the identification (discussed later), based on the notion that several paths may exist between two variables in a causal graph. However, only some are crucial in estimating the causal effect. By identifying the minimal set of variables needed, researchers can effectively and accurately estimate the causal effect with less bias and more efficiency.
3.3.2 Building Blocks in Causal Graphs
In graph theory, the term “flow of association” refers to the presence or absence of association between any two nodes in a given graph. This concept can also be expressed as the statistical dependence or independence between two nodes. In the following section, we shall delve into fundamental constituents such as the building block and terminologies essential to graphical representations of various causal associations between variables. Furthermore, we shall conduct preliminary analyses to investigate the variables’ conditional independence or dependence within the building blocks.
3.3.2.1 Chains
A chain is a sequence of nodes such that each node is a parent of the next node in the sequence as shown in Figure 3.5.
There is dependence between
3.3.2.2 Forks
In a fork, two variables have a single parent between them, as shown in Figure 3.6. Similar to chains, there is dependence between
3.3.2.3 Immoralities
Immoralities refer to a configuration in a directed acyclic graph where a single node relies on two parents who are not directly connected, as depicted in Figure 3.7. The node
3.3.2.4 Blocked Path
The concept of a blocked path is intimately tied to the flow of causal influence. A path between two nodes,
If there exists a node
on the path from to such that it is part of a chain structure ( ) or a fork structure ( ), and is conditioned on ( ).There is a collider
on the path that is not conditioned on ( ), and none of its descendants are conditioned on, i.e. ( ).
3.3.2.5 d-Separation
The concept of d-separation is tied to the concept of a blocked path in the graph. A path from a node(set)
The concept of d-separation implies an important theorem that
3.3.3 Causal Graphs and Structural Interventions
In the context of Causal graphs, a structural intervention is represented as an exogenous variable
For example, consider a causal graph representing the relationship between smoking, age, and cancer, as shown in Figure 3.8. Suppose that smoking and age are the direct causes of cancer. A structural intervention can be performed on smoking, where
This “structural” property is critical to understanding the effect of interventions on causal graphs. Suppose there are multiple simultaneous structural interventions on variables in the graph. In that case, the manipulated distribution for each intervened variable is independent of every other manipulated distribution, and the edge-breaking process is applied separately to each variable. This process implies that all edges between variables subject to intervention are removed. After removing all edges from the original graph incident to variables that are the target of a structural intervention, the resulting graph is called the post-manipulation graph, which represents the manipulated distribution over the variables.
3.3.4 Observational Data and Interventional Data
Next, let us discuss the terminologies of observational and interventional data, which play a vital role in understanding the causal process in greater detail.
Observational data is collected simply by passively observing a system or population without any intervention or change in the process. In contrast, interventional data is collected by actively manipulating the system or population in some way, like in randomized control trials. Observational studies may be subject to various types of biases, such as confounding, selection bias, and measurement bias, making it difficult to distinguish between causal and non-causal relationships in the dataset.
Interventional data, on the other hand, is often considered to be the gold standard for establishing causal relationships between variables. This is because interventions allow actively manipulating the independent variable and observing the resulting changes in the dependent variable. Researchers can minimize the effects of confounding and other biases by randomly assigning participants to different treatment groups, like in randomized control trials.
The acquisition of observational data is generally less resource-intensive than interventional data, which can be expensive and impractical to obtain in specific scenarios. This raises the question of whether it is possible to derive interventional data from observational data.
3.3.5 The do-operator and Interventions
The do-operator and identification process are essential tools that facilitate us going from observational to interventional data. The do-operator helps distinguish interventional distributions from observational distributions, while identification helps determine which causal relationships can be inferred from the observed data. We will discuss the process in detail in the following section.
The do-operator is a symbolic notation used in causal inference to represent interventions. The do-operator is a notation to represent the population’s intervention distribution. Given a treatment (
Also, the potential outcome (
3.3.6 Modularity assumptions
A modular assumption is about the local impact of any intervention when applied to a causal graph. Modularity assumption is also known as invariance, autonomy, and independent mechanisms.
It states that if a node
3.3.7 Modularity Assumptions and Truncated Factorization
The network factorization for Figure 3.10 can be written as:
The network factorization gives us the following:
Now, in a separate statistical world we can rewrite
3.3.8 Structural Causal Models (SCM)
Structural Causal Models (SCMs) are a critical component of modeling for causal inference. SCMs are mathematical models representing the causal relationships between variables in a system (Spirtes et al. 2000).
Structural Causal Models (SCMs) consist of two fundamental constituents: endogenous variables’ structural equations that portray causal relationships among variables and exogenous variables that represent system variables unaffected by other variables. Causal graph structures depict the structural equations as functions among the variables, visually showcasing the relationships between exogenous and endogenous variables.
In simplest form, if variable
3.3.8.1 Interventions and Modularity Assumptions in SCM
For the basic causal model as shown in Figure 3.13, with single confounder
3.4 Identification
Identification is the process of converting causal estimands to statistical estimands. i.e., to go from
Let us consider a simple identification process with one treatment (
Identification is to find
Applying the modularity assumption by intervening on the variable (
If we marginalize the variable (
Thus, from the causal estimand
As highlighted in the overall process illustrated in Figure 3.15, the identification process involves a transformation from causal estimand
Identification methods can be classified into two broad categories with subcategories, as shown below.
- Graphical Constraint-based Methods
- Randomized Control Tests
- Backdoor Adjustments
- Frontdoor Adjustments
- Non-Graphical Constraint-based Methods
- Instrumental Variables
- Regression Discontinuity
- Difference-in-Differences
- Pearl’s do-calculus
Next, we will go over these different methods of identification.
3.4.1 Randomized Control Trials (RCT)
As elucidated in the second chapter, comprehending the impact of unseen confounding factors when measuring treatment effects on an outcome is challenging. One way to address this issue is through randomized controlled trials, which introduce randomness in the treatment allocation process and ensure that the resulting groups (binary
3.4.2 Backdoor Criterion and Backdoor Adjustment
The interventional causal graph can have various paths from the treatment (
As shown in Figure 3.16, the left side graph is observational
Can we get an equivalent interventional causal graph from the observational graph? The answer is yes, and it can be done by conditioning the nodes/variables in the backdoor paths. By conditioning on
Formally, a set of variables
- If the variable set
blocks all the backdoor paths from to does not contain any descendant of the treatment
Thus, based on the modularity assumption and the backdoor criterion, one can identify the causal effect by:
3.4.3 Front-door Adjustments
Judea Pearl justifies the use of the front-door adjustment method through the illustration of an example in which the effect of smoking (treatment) on cancer (outcome) is studied while taking into account the influence of tar (observed mediator) and an unknown genotype (unobserved confounder), as depicted in Figure 3.18. In such Directed Acyclic Graphs (DAGs), the backdoor criterion is inapplicable due to an unobserved confounder.
The more generic DAG is shown in Figure 3.19 The intuition behind the front-door adjustment can be broken into the following three steps as follows:
Identify the causal impact of treatment (
) on the mediator ( ) Since there are no backdoor paths, we can write:Identify the causal impact of mediator (
) on the outcome ( ) Since there is a backdoor path from to through , we can use the backdoor criterion by conditioning on .
- Combined the two previous steps to identify the causal impact of treatment (
) on the outcome ( ) The above equation is called the frontdoor adjustment. The set of variables satisfy the frontdoor criterion if:
- Variable
mediates the effect of on . - There is no unlocked backdoor path from
to . - All backdoor paths from
to are blocked by .
3.4.4 Instrumental Variable Analysis
In situations where specific variables affect the treatment variable(s) but do not directly influence the outcome variable, identification can be achieved through instrumental variable analysis. For instance, consider the example of three variables, namely smoking, cigarette prices, and cancer, as shown in Figure 3.20. It is apparent that cigarette prices affect whether an individual smokes but do not directly impact the likelihood of developing cancer. Such variables that affect the treatment variable(s) but not the outcome variable are referred to as instrumental variables, as described by Pearl (Pearl 2010).
The role of instrumental variables in the identification process is to help address the problem of endogeneity, which occurs when a variable of interest is correlated with the error term in a regression model. This correlation leads to biased and inconsistent estimates of the treatment effect, making it difficult to establish a causal relationship between the treatment and the outcome variable. By using an instrumental variable that is uncorrelated with the error term and affects the treatment but not the outcome variable, the IV analysis can isolate the causal effect of the treatment on the outcome variable.
Thus, when generalized, as shown in Figure 3.21, the identification process is a two-stage process. The first step is to measure the effect of the instrument variable on the treatment using regression as given by:
3.4.5 Regression Discontinuity
The regression discontinuity approach is a regression-based technique that is well-suited for identifying real-valued outcomes, particularly in scenarios where the outcome data contain thresholds or cut-offs. This method is commonly applied in cases where treatment is provided when the outcome surpasses a certain threshold but not when it falls below it (Imbens and Lemieux 2008). The treatment/intervention impact above/below the threshold can be used for causality estimation. Examples include receiving scholarships and their implications on admissions/SAT scores or receiving a specific medicine dosage for patients above a certain cut-off of diabetes or cholesterol etc. Figure 3.22 shows an example of student GPA as the outcome (
3.4.6 Difference-in-Differences
The Difference-in-Differences (DID) approach is a regression-based method that effectively identifies real-valued outcomes when measured over time, as highlighted in (Lechner et al. 2011). Specifically, the DID approach allows for estimating the treatment effect by comparing the differences in outcomes over time between the treatment and control groups. This method is often applied at a particular time and enables estimating the treatment effect using regression analysis, wherein the significant differences between the treatment and control groups can be effectively captured.
To make things concrete, let us consider a simple use case of a binary treatment (
3.4.7 Pearl’s do-calculus
If a query
Consider a causal Directed Acyclic Graph (DAG)
The following three rules apply to all interventional distributions that align with the structure of G.
- Rule 1 (Insertion/deletion of observations):
Per Rule 1, any observational node that fails to influence the outcome through a given path or is d-separated from the outcome can be safely disregarded.
The following is the formal definition:
- Rule 2 (Action/observation exchange):
In the context of a randomized controlled trial, researchers can assign treatment and perform either
Per Rule 2, interventions, represented by
- Rule 3 (Insertion/deletion of actions)__: Rule
states that if an intervention (or a expression) does not influence the outcome through any uncontrolled path, it can be disregarded. Specifically, we can eliminate if no causal association (or unblocked causal paths) runs from to .
Both front-door and backdoor adjustment formulae can be derived using solely the do-calculus. It has been established that the do-calculus is complete, i.e., it can identify all the causal estimands if they exist Shpitser and Pearl (2006). This theorem implies that if the repeated application of these three rules cannot eliminate the do-operations, the query
3.5 Estimation
This section will discuss various methods to compute estimation from statistical estimands.
There are two broad types of estimation methods:
Covariate Adjustment Methods:
Covariate adjustment techniques involve utilizing the covariates or features (
- COM Estimator
- GCOM Estimator
- X-Learner
- TarNET
- Matching
- Doubly Robust Learners
Propensity Score Methods:
In these methods, a propensity score is defined as the conditional probability of treatment assignment given a set of observed covariates
Some of the techniques are: 1. Propensity Score Matching 2. Propensity Score Stratification 3. Inverse Propensity Score Weighting
3.5.0.1 Conditional Outcome Modeling Estimator (COM Estimator or S-Learner)
As discussed in Chapter 2, the individualized treatment effect (ITE) is fundamentally unknowable; hence, large randomized experiments allow us to measure the average treatment effect (ATE). The individualized treatment effect
The ATE is given by:
To compute the statistical estimand, a machine learning model (for example, a regression model) can be used
Thus, the ATE using COM estimator is denoted by
Now, CATE estimation using both the adjustment set
Thus, ITE (which is primarily the measure we want) can be approximated using the difference in the predictions of two models as
3.5.1 Grouped Conditional Outcome Modeling Estimator (GCOM Estimator)
In most cases, the treatment
3.5.2 TARNet
COM estimators combine treatment
3.5.3 X-Learner
Kunzell et al. proposed a meta-learner, the X-learner, to overcome the limitations of Generalized Causal Outcome Model (GCOM) estimators. The GCOM approach falls short in its failure to utilize the complete dataset for estimating the Conditional Average Treatment Effect (CATE). In contrast, the X-learner uses all available data for both models that comprise the estimator, particularly in scenarios involving binary treatment variables, as detailed in (Künzel et al. 2019).
X-learner has the following three stages:
Step 1
Assume
Step 2(a)
In the first part, imputed ITE is computed for treatment group
Step 2(b)
In this step, a supervised machine learning algorithm like regression can be used to fit a model
Step 3
The two estimators are combined using a weighting function
3.5.4 Matching
Matching is a relatively straightforward estimation technique wherein individuals from the treated and control groups are matched based on their covariates or confounders
Figure 3.25 shows a simplified view with two dimensions
Formally, the following procedure is followed for 1-NN as below:
Define a similarity or a distance metric
.For each individual, define
so that we find the closest counterfactual match (treatment ( )) with another individualThus, every individual ITE can be computed using the actual and the potential counterfactual outcome obtained from
above. These two can be combined into a single notation:Thus, ATE can be computed using the average across all the individuals
Consequently, calculating the average treatment effect across the matched groups enables the causal effect estimation since the confounders are similar within these groups, and any differences are attributed solely to the treatment. This simplistic method works particularly well when the number of confounders is limited. However, as the number of dimensions or confounders increases, the method may suffer from the curse of dimensionality. Despite this drawback, the matching technique is easily interpretable by domain experts, although it heavily relies on the underlying metric of distance or similarity.
Notably, it has been demonstrated that the matching algorithm employing the 1-Nearest Neighbor (1-NN) method is equivalent to the covariate adjustment method, which facilitates relating theoretical properties based on this method.
3.5.5 Doubly Robust Estimator
Conditional outcome modeling (
The doubly robust method has a property that they are consistent estimators of ATE
3.5.6 Double Machine Learning
Double machine learning estimators, as the name suggests, use machine learning to learn estimators in two stages to “partial out” the confounders (and other covariates)
- Stage
1.1 Fit a machine learning model to predict
1.2 Fit a machine learning model to predict
- Partial out the confounding effect by fitting another model to predict
and
3.5.7 Causal Trees and Causal Forests
Causal trees are similar to classification/regression trees, where leaf nodes, similar to the decision trees, are outcome variables, but the internal nodes are only limited to covariates and do not include the treatment (Wager and Athey 2018). The general algorithm is:
First, the observational data is divided into a train (
) and a test ( ) set. The train set is used for building the tree, and the test set is used for estimation.A greedy algorithm creates the splits like a regular decision tree. The goal of creating partition (
) using the covariates is slightly different in causal trees compared to standard decision trees. The purpose of creating splits is to find the best covariate to split the node such that the treated group have a different outcome than the control group. The Kullback-Leibler Divergence is one of the techniques used to measure divergence between the outcome class distributions. If there are outcomes, and are the outcome distribution in the treated and control groups, respectively, the KL divergence between the two is given by: For a covariate that splits a node into children nodes, with total instances into children, a conditional divergence test can be performed using KL divergence
- Once the tree is fully constructed, the test set
is used to estimate the treatment effects at the leaf nodes.
Causal Forests are an extension of the idea of Causal trees for estimating the ATE. If we have a training set
3.5.8 Propensity Score-Based
As previously discussed, unbiased estimation of the average treatment effect can be achieved through a randomized controlled test, wherein individuals are assigned to either the treatment or control group based on a coin flip. The propensity score technique, on the other hand, aims to re-weight the observational data such that it resembles pseudo-randomized control test data (Imbens and Rubin 2015).
Consider a scenario involving binary treatment and two covariates within an observational dataset. The data points can be separated into two regions with opposing distributions. The propensity scores method involves re-weighting the samples, as depicted, to modify the distribution through weighting such that it is similar and closely approximates randomized assignment.
Propensity score is the probability of being subjected to the treatment
Propensity score theorem Given the positivity assumption, the unconfoundedness given the adjustment set
Inverse Propensity Weighting (IPW) Estimator Given the data
3.5.9 Propensity Score Matching
The Propensity Score Matching (PSM) algorithm is a methodology that emulates a Randomized Controlled Trial (RCT) in its approach to contrasting outcomes between treated and untreated cohorts within the sample that Propensity Score has matched.
However, the implementation of PSM necessitates careful consideration of certain caveats:
The first caveat, termed ‘Common Support’, necessitates that the distribution of propensity for treatment is analogous or identical across both treated and untreated cases.
The second caveat demands the exclusive utilization of baseline attributes unaffected by the intervention during the Matching process.
Thirdly, potential confounding variables must be both observable and without any hidden variables. A failure in this respect could result in biased estimates.
Finally, the fourth caveat advises matching the most pertinent characteristics rather than indiscriminately incorporating every variable into the equation.
The PSM process involves several key steps, as outlined by Jalan and Ravallion (Jalan and Ravallion 2003):
- The calculation of the Propensity Score for all units.
- The matching of treatment cohorts with control cohorts is performed following a predetermined matching strategy; for instance, a strategy could involve using the nearest neighbor method between the treated and control groups, implemented without replacement.
- The evaluation of covariate balance. In the event of an imbalance, revisiting the first and second steps and incorporating alternative specifications is advisable.
- The computation of the average outcome difference between the treatment and control groups.
3.5.10 Propensity Score Stratification
King and Nielsen propose that Propensity Score Matching (PSM) is designed to replicate a fully randomized experiment instead of a blocked randomized one. They further discuss that the exact matching procedure in PSM exacerbates issues such as imbalance, inefficiency, model dependence, and bias while also being unable to effectively mitigate the imbalance (King and Nielsen 2019).
Propensity Score (PS) stratification serves as a balancing mechanism, ensuring that the distribution of observed covariates appears comparable between treated and control groups when conditioned on the PS (Austin 2011). As a result, it facilitates adjusting imbalances in the covariates by modifying the score accordingly.
The specific steps to execute for PS stratification are:
- Calculate the Propensity Score (PS) using logistic regression.
- Mutually exclusive strata are established based on the estimated PS.
- Both the treated and control units are grouped into each stratum.
- The difference in means between treated and control groups is calculated within each stratum.
- The means within each stratum are then weighted to achieve the target estimate.
In the second step of the process, studies have shown that approximately 90% of the bias inherent in the unadjusted estimate can be removed using five strata (Rosenbaum and Rubin 1984). However, the idea that increasing the number of strata beyond this point would lead to a further decrease in bias is not empirically supported. Indeed, simulation studies have indicated that the most favorable outcomes are achieved with between 5 and 10 strata, with different strata beyond this range contributing only minor improvements (Neuhäuser, Thielmann, and Ruxton 2018). It is also essential to consider the practical implications of increasing the number of strata. As the number of strata increases, the number of data points available within each stratum decreases.
During the fifth step of the process, Propensity Score (PS) stratification enables the calculation of both the Average Treatment Effect (ATE) and the Average Treatment Effect on the Treated (ATT), contingent on the weighting method utilized for the means. For the estimation of the ATE, the weighting is determined by the number of units within each stratum. On the other hand, the estimation of the ATT involves assigning weights according to the count of treated units present in each stratum (Imbens 2004).
3.6 Evaluation and Validation Techniques
Evaluating and validating causal models differ from traditional machine learning models, where techniques such as cross-validation and test set evaluations are performed. This section will highlight some standard metrics and methodologies to evaluate causal models.
3.6.1 Evaluation Metrics
The two broad categories for the evaluation metrics are based on whether the subpopulation is homogeneous or heterogeneous and are:
- Standard Causal Effect Estimation
- Heterogeneous Effect Estimation
3.6.1.1 Standard Causal Effect Estimation
Assuming the potential outcome to be real-valued, ff there
- Mean Squared Error of Average Treatment Effect:
- Root Mean Squared Error of Average Treatment Effect:
- Mean Absolute Error of Average Treatment Effect:
3.6.1.2 Heterogeneous Effect Estimation
- Uplift Curve
Uplift modeling aims to identify the effect of an intervention on a particular individual rather than a population, especially in the case of heterogeneity (Gutierrez and Gérardy 2017). Thus, uplift modeling attempts to estimate the ITE (or CATE), i.e., the treatment outcome of a given individual and how it would differ in the absence.
The methodology to generate the uplift curve has parallels to ROC curves in standard machine learning, and the steps are: 1. Use the machine learning method as discussed for estimation and generate CATE for each individual (
- The uplift curve is then plotted with the x-axis representing the percentiles of the population and the y-axis representing the uplift gain from the above corresponding to each group.
The advantage of the uplift curve is that we can select the decile that maximizes the gain as the limit of the population to be targeted next time rather than the whole population.
- Qini Curve
Qini curve is a variant of uplift curve where Qini score is computed instead of the uplift score as:
- Uplift(Qini) Coefficient
Similar to AUC-ROC curve, one can compute the area under the uplift (qini) curve, and is referred to as the uplift (qini) coefficient and is given by:
3.6.2 Robustness Checks and Refutation Techniques
Several assumptions are made in every step of causal inference, from building the causal model to estimation. Assumptions are made at the modeling level, such nonexistence of unobserved variables, the relationship between variables (edges in the graph), etc. We might make parametric assumptions for deriving the estimand at the identification step. At the estimation step, we might assume a linear relationship between treatment and observed and similarly between treatment and outcome. Many of these assumptions can be and should be tested for violations, if any. There are many assumptions that are not possible to be validated or refuted.
Similar to standard software testing, there is unit or modular testing and integration testing.
3.6.2.1 Unit or Modular Tests
Designing tests or validations to individually check the assumptions on the model, identification, and estimation process. Some of the tests are:
Conditional Independence Tests: Using the dependence graph and data to validate various independence assumptions, for example, with two variables (
) and their relationship with treatment ifD-Separation Tests: Conditional and marginal independence can be tested between variables in graphs using moralize, orient, delete/add edges, etc.
Bootstrap Sample Validation: Replacing the dataset completely with bootstrapped samples from the graph helps calculate statistically significant changes in the estimand.
Data Subsets Validation: Replacing the given dataset with a randomly selected subset helps to compute changes in the estimands and gauge the impact.
3.6.2.2 Integration or Complete Tests
In integration testing, comprehensive testing on the entire process for validating many underlying assumptions rather than on single steps. Some of them are:
Placebo Treatment Refuter: What would impact the outcome if the treatment variable is replaced by a random variable (e.g., Gaussian)? It should have no impact (zero value of estimation) if the assumptions are all correct or some steps must be corrected.
Adding Random Common Cause: Adding an independent random variable as a common cause should keep the estimate the same. This method can be easily tested on the dataset to see the significance of the estimation change.
Dummy Outcome Refuter: The estimated causal effect should be zero if we replace the outcome variable with an independent random variable.
Simulated Outcome Refuter or Synth Validation: If multiple datasets are generated very close to the generation process of the existing dataset and the assumptions made, the estimation effect should remain the same. This technique is also known as the synth validation technique and is one of the most comprehensive tests for process validation.
Adding Unobserved Confounder: One of the real-world cases if missing the observed confounder from the data or modeling. By simulating a confounder based on some correlation
between the outcome and the treatment, one can run the analysis and see the difference in the estimation. A significant change illustrates a robustness issue in the process.
3.7 Unconfoundedness: Assumptions, Bounds, and Sensitivity Analysis
Throughout the discussion, we assumed unconfoundedness or observed confounding in our inference process. However, Manski et al., in their work, showed that the no unobserved confounding assumption is unrealistic in the real world (Manski 2003).
In the simplest case, we assume an unobserved confounder
With the simple assumption that the outcome
3.7.1 Observational Counterfactual Decomposition
The ATE can be written in terms of observational and counterfactual components, known as observational-counterfactual decomposition.
The linearity of expectations gives:
Thus the equation has observational elements
3.7.2 Bounds
We will provide an overview of nonparametric bounds and elucidate the process of deriving them.
3.7.2.1 No-Assumption Bounds
The no-assumption bounds are the simplest bound that reduces the interval
Thus, the interval length is:
3.7.2.2 Nonnegative Monotone Treatment Response Assumption
Assuming that the treatment always helps, i.e.,
By assuming the reverse, the treatment never helps, i.e.,
3.7.2.3 Monotone Treatment Selection Assumption
The assumption is that the treatment groups’ potential outcomes are better than the control groups. Thus, we get
3.7.2.4 Optimal Treatment Selection Assumption
The assumption here is that each individual gets the treatment that is best suited, i.e.
3.7.3 Sensitivity Analysis
Given the presence of observed
Considering a simple setting with observed variable
A contour plot for different values of
Many researchers, such as Cinelli et al. and Vetich et al., have shown techniques to reduce the constraints (linear assumptions, single variable) and yet be able to perform sensitivity analyses similar to the simple case (Cinelli and Hazlett 2020, veitch2020sense).
3.8 Case Study
We will go through different steps and processes of causal inference to demonstrate and give a practical hands-on experience with a real-world dataset. The goal is to take various steps highlighted in the chapter using the tools. A version of the Python code used in this case study can be found in this Colab notebook.
3.8.1 Dataset
Economist Jean-Jacques Lalonde collected the Lalonde dataset in the 1970s, and has been widely used in research on the evaluation of social programs. The NSWD program was designed to test the effectiveness of a job training and placement program for disadvantaged individuals. The program provided job training, job placement services, and a wage subsidy to disadvantaged individuals (the treatment group), while a control group received no treatment. The goal of the program was to determine whether the treatment had a positive effect on the employment and earnings of the participants. The dataset includes a variety of variables, including demographic characteristics (such as age, education level, and race), employment status, and income.
3.8.2 Tools and Library
We will use doWhy, a Python library for causal inference, for most modeling and analysis. The library includes tools for performing various causal inference tasks, such as identifying the causal effect of a treatment on an outcome variable, estimating the total effect of a treatment on an outcome variable using various interchangeable estimators and assessing the robustness of causal estimates to assumptions about the data generating process. In the case study we use causalml and causallift for further distributional analysis and uplfit modeling. Python libraries such as pandas, matplotlib, scikit-learn etc. are used for data processing, visualization and machine learning.
3.8.3 Exploratory Data Analysis
Plotting the treated group vs. control group with various variables (age, race, income, education) for understanding the distribution across the two as shown:
One can see that the dataset is not balanced between the treated and the control group. The difference between the treated and control groups is quite evident for various variables such as education, age, and hispanic. This may cause issues in many estimation processes and in the propensity-based estimation, we will highlight how the propensity-based techniques change the distribution through weights.
3.8.4 Estimation and Results
3.8.4.1 Identification of Estimand
As discussed we first identify the estimand with variables treat as the treatment
from dowhy import CausalModel
= CausalModel(
model = lalonde_df,
data ='treat',
treatment='re78',
outcome='nodegr+black+hisp+age+educ+married'.split('+')
common_causes
)= model.identify_effect() identified_estimand
The causal graph showing the relationships between the outcome, treatment, and observed confounders is shown in Figure 3.30
3.8.4.2 Estimation and Robustness
We have explored many linear, non-linear, propensity-based, and causal tree-based estimators to give the readers a more comprehensive view.
A simple linear regression estimation is shown, and the results.
= model.estimate_effect(
linear_regression_estimate
identified_estimand,="backdoor.linear_regression",
method_name=0,
control_value=1
treatment_value
)print(linear_regression_estimate)
*** Causal Estimate ***
## Identified estimand
Estimand type: EstimandType.NONPARAMETRIC_ATE
### Estimand : 1
Estimand name: backdoor
Estimand expression:
d
--------(E[re78|age,nodegr,married,educ,hisp,black])
d[treat]
Estimand assumption 1, Unconfoundedness: If U→{treat} and U→re78 then
P(re78|treat,age,nodegr,married,educ,hisp,black,U) =
P(re78|treat,age,nodegr,married,educ,hisp,black)
## Realized estimand
b: re78~treat+age+nodegr+married+educ+hisp+black
Target units: ate
## Estimate
Mean value: 1671.1304316174173
As discussed in the exploratory data analysis, the data distribution was not symmetrical between the control and the treated group, so we used the inverse propensity-score weighting technique as one of the estimators.
= model.estimate_effect(
causal_estimate_ipw
identified_estimand,="backdoor.propensity_score_weighting",
method_name= "ate",
target_units ={"weighting_scheme":"ips_weight"}
method_params
)
print(causal_estimate_ipw)
The doWhy library provides interesting interpreting techniques to understand the change in distribution, as shown in the listing.
causal_estimate_ipw.interpret(="confounder_distribution_interpreter",
method_name='discrete',
var_type='married',
var_name= (10, 7),
fig_size = 12
font_size )
Table for Estimator Comparison
Estimator | ATE |
---|---|
Naive | 1794.342 |
Linear Regression | 1671.13 |
T-Learner | 1693.76 |
X-Learner | 1763.83 |
T-Learner | 1693.76 |
Double Machine Learner | 1408.93 |
Propensity Score Matching | 1498.55 |
Propensity Score Stratification | 1838.36 |
Propensity Score and Weighting | 1639.80 |
3.8.5 Refutation and Validation
Next, we highlight some refutation and validation tests performed on the model, as discussed in the chapter.
3.8.5.1 Removing Random Subset of Data
We choose the causal estimate from inverse causal weighting to perform the refutation as shown:
= model.refute_estimate(
res_subset
identified_estimand,
causal_estimate_ipw,="data_subset_refuter",
method_name=True,
show_progress_bar=0.9
subset_fraction )
The difference between the two is around
Refute: Use a subset of data
Estimated effect:1639.7956658905296
New effect:1656.1009245901791
p value:0.98
3.8.5.2 Placebo Treatment
Replacing treatment with a random (placebo) variable as shown:
import numpy as np
= model.refute_estimate(
res_placebo
identified_estimand,
causal_estimate_ipw,="placebo_treatment_refuter",
method_name=True,
show_progress_bar="permute"
placebo_type )
The output
Refute: Use a Placebo Treatment
Estimated effect:1639.7956658905296
New effect:-209.15727259572515
p value:0.78
The causal estimation through inverse probability weighting can be considered robust based on the p-value.