Random Forest CI Test

The Random Forest Conditional Independence (CI) test is a flexible, non-parametric method that leverages the predictive power of ensemble models to assess conditional independence. By measuring the importance of a feature in a predictive task, it can capture complex, non-linear relationships and interactions, making it well-suited for a wide range of data types.

Mathematical Formulation

The test is based on a simple but powerful premise: if a variable X is conditionally independent of a target Y given a set of conditioning variables Z, then X should have no predictive power for Y when the information in Z is already available. The test formalizes this idea by measuring the feature importance of X within a Random Forest model trained to predict Y.

The most robust measure of feature importance for this task is permutation importance (Breiman, 2001; Fisher et al., 2019). The overall procedure to obtain a p-value is as follows:

  1. Train a Predictive Model: A Random Forest model is trained to predict the target variable Y using the predictor X and the conditioning set Z as features.

  2. Calculate Observed Importance: The permutation importance of X is calculated. This is done by first recording the model’s performance (e.g., R² for regression, accuracy for classification) on a hold-out dataset. Then, the values in the column corresponding to feature X are randomly shuffled (permuted), and the model’s performance is measured again. The feature importance is the drop in performance caused by this shuffling. This serves as the observed test statistic. $\( \text{Importance}(X) = \text{Performance}_{\text{original}} - \text{Performance}_{\text{permuted}(X)} \)$

  3. Generate a Null Distribution: To determine if the observed importance is statistically significant, we need to generate a distribution of importances under the null hypothesis (\(Y \perp X | Z\)). This is achieved by permuting the relationship between X and Y while preserving the relationship with Z. Specifically, for a number of repetitions:

    • The values of feature X are permuted.

    • A new Random Forest is trained from scratch on this permuted dataset (predicting Y from the permuted X and original Z).

    • The permutation importance of the (permuted) X is calculated for this new model.

    • This collection of importance scores forms the null distribution.

  4. Calculate P-Value: The p-value is the proportion of importance scores in the null distribution that are greater than or equal to the originally observed importance statistic.

Properties and Assumptions

  • Non-parametric: The test does not rely on assumptions of linearity or specific data distributions.

  • Handles Interactions and Mixed Data: Random Forests naturally handle interaction effects between variables and can be used with a mix of continuous and categorical data types.

  • Model-Agnostic Principle: While this implementation uses Random Forest, the underlying permutation-based testing framework is model-agnostic and can be applied with other predictive models (Fisher et al., 2019).

  • Assumptions: The primary assumption is that the Random Forest model is a sufficiently good predictor of the underlying relationships. If the model fails to capture the predictive patterns, the feature importance measures will not be reliable.

Code Example

import numpy as np
from citk.tests import RandomForest

# Generate data with a non-linear relationship: X -> Z -> Y
n = 500
X = np.random.randn(n)
Z = np.sin(X * 2) + np.random.randn(n) * 0.2
Y = Z**3 + np.random.randn(n) * 0.2
data = np.vstack([X, Y, Z]).T

# Initialize the test
# num_permutations can be increased for more precise p-values
rf_test = RandomForest(data, num_permutations=99, random_state=42)

# Test for unconditional independence (should be dependent)
p_unconditional = rf_test(0, 1)
print(f"P-value (unconditional) for X _||_ Y: {p_unconditional:.4f}")

# Test for conditional independence given Z (should be independent)
p_conditional = rf_test(0, 1, [2])
print(f"P-value (conditional) for X _||_ Y | Z: {p_conditional:.4f}")

API Reference

For a full list of parameters, see the API documentation: :class:citk.tests.ml_based_tests.RandomForest.

References

  • Breiman, L. (2001). Random Forests. Machine Learning, 45(1), 5-32.

  • Fisher, A., Rudin, C., & Dominici, F. (2019). All Models are Wrong, but Many are Useful: Learning a Variable’s Importance by Studying an Entire Class of Prediction Models. Journal of Machine Learning Research, 20(177), 1-81.

  • Strobl, C., Boulesteix, A. L., Kneib, T., Augustin, T., & Zeileis, A. (2008). Conditional variable importance for random forests. BMC Bioinformatics, 9(1), 307.