Double Machine Learning (DML) CI Test

The Double Machine Learning (DML) Conditional Independence test is a modern, robust method for assessing conditional independence, drawing on the framework developed by Chernozhukov et al. (2018). It is designed to provide reliable statistical inference even when the relationships between variables are complex and high-dimensional. DML achieves this by using flexible machine learning models to control for confounding variables (Z) and then performing a statistical test on the “cleaned” or residualized variables. The application of DML has been extended to causal structure learning from observational data under minimal assumptions, allowing for the identification of direct causes and other causal relationships (Soleymani, 2024).

Mathematical Formulation

The core idea of DML is to test for the conditional independence of X and Y given Z by first “partialling out” or removing the influence of Z from both X and Y. This is framed as a problem of estimating nuisance functions.

Nuisance Prediction: Two machine learning models are trained to capture the relationships between the conditioning set Z and the variables of interest, X and Y. These relationships are the “nuisance functions.”
- \(g(Z) = E[Y | Z]\)
- \(f(Z) = E[X | Z]\)
To avoid biases from overfitting, DML employs a crucial technique called cross-fitting (or sample splitting). The data is split into k folds. For each fold, models for \(f\) and \(g\) are trained on the other k-1 folds and then used to predict on the held-out fold. This ensures that the predictions for each data point are generated by a model that was not trained on that point.
Residual Computation: Using the out-of-sample predictions (\(\hat{f}\) and \(\hat{g}\)), residuals are computed for each observation:
- \(V = Y - \hat{g}(Z)\)
- \(U = X - \hat{f}(Z)\) These residuals represent the parts of Y and X that are orthogonal to (or unexplained by) Z.
Residual Independence Test: If X and Y are conditionally independent given Z, then their residuals, U and V, should be independent. The final step is to perform a simple independence test on these residuals. This implementation uses a permutation test on the distance correlation of U and V, which is a powerful non-parametric test for dependence.

This two-stage procedure makes the final test “doubly robust,” meaning it is less sensitive to estimation errors in the nuisance functions than a more naive approach would be.

Properties and Assumptions

Model Agnostic: Any supervised machine learning model (e.g., Gradient Boosting, Random Forest, Ridge Regression) can be used for the nuisance prediction stage.
Robustness: By using cross-fitting and focusing the final test on residuals, DML provides valid statistical inference even when the nuisance functions are complex and estimated with flexible models.
Assumptions: The primary assumption is that the machine learning models are sufficiently powerful and well-tuned to consistently estimate the underlying nuisance functions. The quality of the test relies on the quality of these models and the cross-fitting procedure.

Code Exampleimport numpy as np
from citk.tests import DML

# Generate data with a linear relationship: X -> Z -> Y
n = 1000
X = np.random.randn(n)
Z = 2 * X + np.random.randn(n) * 0.5
Y = 3 * Z + np.random.randn(n) * 0.5
data = np.vstack([X, Y, Z]).T

# Initialize the test. A custom model can also be passed, e.g., model=Ridge().
dml_test = DML(data)

# Test for unconditional independence (should be dependent)
p_unconditional = dml_test(0, 1)
print(f"P-value (unconditional) for X _||_ Y: {p_unconditional:.4f}")

# Test for conditional independence given Z (should be independent)
p_conditional = dml_test(0, 1, [2])
print(f"P-value (conditional) for X _||_ Y | Z: {p_conditional:.4f}")

API Reference

For a full list of parameters, see the API documentation: :class:citk.tests.ml_based_tests.DML.

References

Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey, W., & Robins, J. (2018). Double/debiased machine learning for treatment and structural parameters. The Econometrics Journal, 21(1), C1-C68.
Soleymani, A. (2024). Causal Structure Learning through Double Machine Learning. Massachusetts Institute of Technology.