# Random Forest CI Test

The Random Forest Conditional Independence (CI) test is a flexible, non-parametric method that leverages the predictive power of ensemble models to assess conditional independence. By measuring the importance of a feature in a predictive task, it can capture complex, non-linear relationships and interactions, making it well-suited for a wide range of data types.

## Mathematical Formulation

The test is based on a simple but powerful premise: if a variable *X* is conditionally independent of a target *Y* given a set of conditioning variables *Z*, then *X* should have no predictive power for *Y* when the information in *Z* is already available. The test formalizes this idea by measuring the feature importance of *X* within a Random Forest model trained to predict *Y*.

The most robust measure of feature importance for this task is **permutation importance** (Breiman, 2001; Fisher et al., 2019). The overall procedure to obtain a p-value is as follows:

1.  **Train a Predictive Model**: A Random Forest model is trained to predict the target variable `Y` using the predictor `X` and the conditioning set `Z` as features.

2.  **Calculate Observed Importance**: The permutation importance of `X` is calculated. This is done by first recording the model's performance (e.g., R² for regression, accuracy for classification) on a hold-out dataset. Then, the values in the column corresponding to feature `X` are randomly shuffled (permuted), and the model's performance is measured again. The feature importance is the drop in performance caused by this shuffling. This serves as the **observed test statistic**.
    $$
    \text{Importance}(X) = \text{Performance}_{\text{original}} - \text{Performance}_{\text{permuted}(X)}
    $$

3.  **Generate a Null Distribution**: To determine if the observed importance is statistically significant, we need to generate a distribution of importances under the null hypothesis ($Y \perp X | Z$). This is achieved by permuting the relationship between *X* and *Y* while preserving the relationship with *Z*. Specifically, for a number of repetitions:
    *   The values of feature `X` are permuted.
    *   A *new* Random Forest is trained from scratch on this permuted dataset (predicting `Y` from the permuted `X` and original `Z`).
    *   The permutation importance of the (permuted) `X` is calculated for this new model.
    *   This collection of importance scores forms the **null distribution**.

4.  **Calculate P-Value**: The p-value is the proportion of importance scores in the null distribution that are greater than or equal to the originally observed importance statistic.

## Properties and Assumptions

*   **Non-parametric**: The test does not rely on assumptions of linearity or specific data distributions.
*   **Handles Interactions and Mixed Data**: Random Forests naturally handle interaction effects between variables and can be used with a mix of continuous and categorical data types.
*   **Model-Agnostic Principle**: While this implementation uses Random Forest, the underlying permutation-based testing framework is model-agnostic and can be applied with other predictive models (Fisher et al., 2019).
*   **Assumptions**: The primary assumption is that the Random Forest model is a sufficiently good predictor of the underlying relationships. If the model fails to capture the predictive patterns, the feature importance measures will not be reliable.

## Code Example

```python
import numpy as np
from citk.tests import RandomForest

# Generate data with a non-linear relationship: X -> Z -> Y
n = 500
X = np.random.randn(n)
Z = np.sin(X * 2) + np.random.randn(n) * 0.2
Y = Z**3 + np.random.randn(n) * 0.2
data = np.vstack([X, Y, Z]).T

# Initialize the test
# num_permutations can be increased for more precise p-values
rf_test = RandomForest(data, num_permutations=99, random_state=42)

# Test for unconditional independence (should be dependent)
p_unconditional = rf_test(0, 1)
print(f"P-value (unconditional) for X _||_ Y: {p_unconditional:.4f}")

# Test for conditional independence given Z (should be independent)
p_conditional = rf_test(0, 1, [2])
print(f"P-value (conditional) for X _||_ Y | Z: {p_conditional:.4f}")
```

## API Reference

For a full list of parameters, see the API documentation: :class:`citk.tests.ml_based_tests.RandomForest`.

## References

*   Breiman, L. (2001). Random Forests. *Machine Learning, 45*(1), 5-32.
*   Fisher, A., Rudin, C., & Dominici, F. (2019). All Models are Wrong, but Many are Useful: Learning a Variable's Importance by Studying an Entire Class of Prediction Models. *Journal of Machine Learning Research, 20*(177), 1-81.
*   Strobl, C., Boulesteix, A. L., Kneib, T., Augustin, T., & Zeileis, A. (2008). Conditional variable importance for random forests. *BMC Bioinformatics, 9*(1), 307.