Random Forest CI Test

The Random Forest Conditional Independence (CI) test is a flexible, non-parametric method that leverages the predictive power of ensemble models to assess conditional independence. By measuring the importance of a feature in a predictive task, it can capture complex, non-linear relationships and interactions, making it well-suited for a wide range of data types.

Mathematical Formulation

The test is based on a simple but powerful premise: if a variable X is conditionally independent of a target Y given a set of conditioning variables Z, then X should have no predictive power for Y when the information in Z is already available. The test formalizes this idea by measuring the feature importance of X within a Random Forest model trained to predict Y.

The most robust measure of feature importance for this task is permutation importance (Breiman, 2001; Fisher et al., 2019). The overall procedure to obtain a p-value is as follows:

Train a Predictive Model: A Random Forest model is trained to predict the target variable Y using the predictor X and the conditioning set Z as features.
Calculate Observed Importance: The permutation importance of X is calculated. This is done by first recording the model’s performance (e.g., R² for regression, accuracy for classification) on a hold-out dataset. Then, the values in the column corresponding to feature X are randomly shuffled (permuted), and the model’s performance is measured again. The feature importance is the drop in performance caused by this shuffling. This serves as the observed test statistic. $$ \text{Importance}(X) = \text{Performance}_{\text{original}} - \text{Performance}_{\text{permuted}(X)} $$
Generate a Null Distribution: To determine if the observed importance is statistically significant, we need to generate a distribution of importances under the null hypothesis ($Y \perp X | Z$). This is achieved by permuting the relationship between X and Y while preserving the relationship with Z. Specifically, for a number of repetitions:
- The values of feature X are permuted.
- A new Random Forest is trained from scratch on this permuted dataset (predicting Y from the permuted X and original Z).
- The permutation importance of the (permuted) X is calculated for this new model.
- This collection of importance scores forms the null distribution.
Calculate P-Value: The p-value is the proportion of importance scores in the null distribution that are greater than or equal to the originally observed importance statistic.

Properties and Assumptions

Non-parametric: The test does not rely on assumptions of linearity or specific data distributions.
Handles Interactions and Mixed Data: Random Forests naturally handle interaction effects between variables and can be used with a mix of continuous and categorical data types.
Model-Agnostic Principle: While this implementation uses Random Forest, the underlying permutation-based testing framework is model-agnostic and can be applied with other predictive models (Fisher et al., 2019).
Assumptions: The primary assumption is that the Random Forest model is a sufficiently good predictor of the underlying relationships. If the model fails to capture the predictive patterns, the feature importance measures will not be reliable.

Code Exampleimport numpy as np
from citk.tests import RandomForest

# Generate data with a non-linear relationship: X -> Z -> Y
n = 500
X = np.random.randn(n)
Z = np.sin(X * 2) + np.random.randn(n) * 0.2
Y = Z**3 + np.random.randn(n) * 0.2
data = np.vstack([X, Y, Z]).T

# Initialize the test
# num_permutations can be increased for more precise p-values
rf_test = RandomForest(data, num_permutations=99, random_state=42)

# Test for unconditional independence (should be dependent)
p_unconditional = rf_test(0, 1)
print(f"P-value (unconditional) for X _||_ Y: {p_unconditional:.4f}")

# Test for conditional independence given Z (should be independent)
p_conditional = rf_test(0, 1, [2])
print(f"P-value (conditional) for X _||_ Y | Z: {p_conditional:.4f}")

API Reference

For a full list of parameters, see the API documentation: :class:citk.tests.ml_based_tests.RandomForest.

References

Breiman, L. (2001). Random Forests. Machine Learning, 45(1), 5-32.
Fisher, A., Rudin, C., & Dominici, F. (2019). All Models are Wrong, but Many are Useful: Learning a Variable’s Importance by Studying an Entire Class of Prediction Models. Journal of Machine Learning Research, 20(177), 1-81.
Strobl, C., Boulesteix, A. L., Kneib, T., Augustin, T., & Zeileis, A. (2008). Conditional variable importance for random forests. BMC Bioinformatics, 9(1), 307.