Generalised Covariance Measure (GCM) Test

The Generalised Covariance Measure of Shah & Peters (2020) tests \(X \perp Y \mid Z\) by separately regressing \(X\) on \(Z\) and \(Y\) on \(Z\), then asking whether the resulting residuals are uncorrelated. Shah & Peters (2020) prove that the validity of the test relies almost entirely on the weak requirement that the regression procedures estimate the conditional means at a slow rate, so the choice of regression backend determines the test’s effective assumptions while leaving asymptotic-normal calibration intact.

The citk implementation uses random forest regression by default (via the pycomets library) for both nuisance regressions.

Intuition. Under \(X \perp Y \mid Z\), the residuals \(X - E[X \mid Z]\) and \(Y - E[Y \mid Z]\) are uncorrelated for any choice of \(E\)-estimator that is consistent enough; testing whether their sample covariance is zero gives a CI test that inherits flexibility from modern ML regressors while staying calibrated by a standard-normal limit (Shah & Peters, 2020).

Mathematical Formulation

Let \(\hat{f}(z) \approx \mathbb{E}[X \mid Z = z]\) and \(\hat{g}(z) \approx \mathbb{E}[Y \mid Z = z]\) denote the nuisance regression estimates (Shah & Peters, 2020). The residuals are

\[r_{X,i} = X_i - \hat{f}(Z_i), \qquad r_{Y,i} = Y_i - \hat{g}(Z_i)\]

and the GCM test statistic is the studentised mean of their elementwise product (Shah & Peters, 2020):

\[T_{\mathrm{GCM}} = \frac{\sqrt{n}\, \overline{R}}{\hat{\sigma}_R}, \qquad R_i = r_{X,i} \cdot r_{Y,i}, \quad \overline{R} = \frac{1}{n}\sum_i R_i\]

Under the null \(X \perp Y \mid Z\), \(T_{\mathrm{GCM}} \xrightarrow{d} \mathcal{N}(0, 1)\), regardless of the regression method, provided the nuisance product rate \(\| \hat{f} - f^* \|_2 \cdot \| \hat{g} - g^* \|_2 = o_P(n^{-1/2})\) holds (Shah & Peters, 2020). Power depends directly on regression quality; poor residual estimation reduces the signal available for the covariance test (Shah & Peters, 2020).

Assumptions

  • Consistent nuisance regression. The product rate condition \(\| \hat{f} - f^* \|_2 \cdot \| \hat{g} - g^* \|_2 = o_P(n^{-1/2})\) must hold for the asymptotic normality to be valid; flexible learners like random forests typically meet this in low-to-moderate \(\dim(Z)\) (Shah & Peters, 2020).

  • Variable types. Random forest nuisance regressions handle continuous, discrete, or mixed \(X\), \(Y\), and \(Z\) natively; no separate type declaration is required (Shah & Peters, 2020).

  • No uniformly powerful CI test exists. Shah & Peters (2020) prove that a valid CI test cannot have power against arbitrary alternatives: GCM’s validity is universal but its power is alternative-class-dependent.

  • Sample size. Studentised normal calibration requires an adequate sample for stable variance estimation (Shah & Peters, 2020).

  • Dtype validation is opt-in. Passing data outside the declared dtype produces undefined results; call Test.validate_data(data) to check. citk does not enforce supported_dtypes at construction.

v0.1.0 implementation notes

The pycomets backend, regressor (RandomForestRegressor), and its hyperparameters are not surfaced as constructor kwargs in v0.1.0. The wrapper accepts cache_path only; the underlying regression is fixed to pycomets defaults. Future minor versions may add explicit kwargs (e.g. regressor, n_estimators) — these will be additive, never removing today’s defaults, so v0.1.0 calls remain valid.

Empty conditioning set: when condition_set is empty (or None), the wrapper substitutes a single constant column \(Z = 0\) before passing to pycomets. This means GCM(data)(X, Y, []) tests \(X \perp Y\) marginally via residualisation against a constant, not via the no-conditioning path. The other 16 tests in citk pass an empty conditioning set through unchanged. Same behavior applies to :doc:/tests/wgcm_test and :doc:/tests/pcm_test.

Code Example

import numpy as np
from citk.tests import GCM

# Non-linear chain: X -> Z -> Y
n = 400
X = np.random.randn(n)
Z = np.sin(X) + 0.2 * np.random.randn(n)
Y = Z**2 + 0.2 * np.random.randn(n)
data = np.vstack([X, Y, Z]).T

# Initialize the test (uses pycomets random forest regression by default)
gcm_test = GCM(data)

# Test for conditional independence of X and Y given Z
# Expected: p-value is large (cannot reject H0 of independence)
p_value_conditional = gcm_test(0, 1, [2])
print(f"P-value for X _||_ Y | Z: {p_value_conditional:.4f}")

# Test for unconditional independence of X and Y
# Expected: p-value is small (reject H0 of independence)
p_value_unconditional = gcm_test(0, 1)
print(f"P-value for X _||_ Y: {p_value_unconditional:.4f}")

API Reference

For a full list of parameters, see the API documentation: :class:citk.tests.ml_based_tests.GCM.

References

Shah, R. D., & Peters, J. (2020). The hardness of conditional independence testing and the generalised covariance measure. The Annals of Statistics, 48(3), 1514-1538.