Projected Covariance Measure (PCM) Test¶

The Projected Covariance Measure of Lundborg et al. (2024) is an assumption-lean test for the model-free null of conditional mean independence — that the conditional mean of \(Y\) given \((X, Z)\) does not depend on \(X\). It addresses the principal weakness of :doc:/tests/gcm_test (Shah & Peters, 2020): GCM has reduced power when \(X\) is involved in complex interactions or when the predictor of interest is weakly identified by the nuisance regressions. Lundborg et al. (2024) prove that a spline-regression instance of their procedure attains the minimax optimal rate in this nonparametric testing problem.

The citk implementation uses sample-splitting with random forest regression by default (via the pycomets library).

Intuition. Rather than testing zero residual covariance directly, PCM first uses one half of the data to estimate a projection of \(Y\) on \((X, Z)\) — typically the regression \(\hat{h}(X, Z) \approx \mathbb{E}[Y \mid X, Z]\) — and then on the other half tests the expected conditional covariance between this projection and \(Y\), after adjusting both for \(Z\) (Lundborg et al., 2024). The procedure inherits robust Type I error control from the orthogonality of the residualisation step and gains power from the data-driven projection (Lundborg et al., 2024).

Mathematical Formulation¶

PCM splits the data into two folds. On the first fold it learns a projection function \(\hat{h}(X, Z)\), typically \(\hat{h}(X, Z) \approx \mathbb{E}[Y \mid X, Z]\) (Lundborg et al., 2024). On the second fold it residualises both \(\hat{h}(X, Z)\) and \(Y\) against \(Z\) via nuisance regressions \(\hat{m}_h\) and \(\hat{m}_Y\):

\[\tilde{h}_i = \hat{h}(X_i, Z_i) - \hat{m}_h(Z_i), \qquad \tilde{Y}_i = Y_i - \hat{m}_Y(Z_i)\]

and forms the studentised covariance test statistic (Lundborg et al., 2024):

\[T_{\mathrm{PCM}} = \frac{\sqrt{n_2}\, \overline{R}}{\hat{\sigma}_R}, \qquad R_i = \tilde{h}_i \cdot \tilde{Y}_i\]

Under the null, \(T_{\mathrm{PCM}} \xrightarrow{d} \mathcal{N}(0, 1)\) under nuisance rate conditions analogous to GCM’s (Lundborg et al., 2024). Crucially, validity of the test does not require \(\hat{h}\) to be a good predictor of \(Y\) — only that the residualisation step is consistent — so PCM remains assumption-lean (Lundborg et al., 2024).

Assumptions¶

Conditional mean independence null. PCM tests \(\mathbb{E}[Y \mid X, Z] = \mathbb{E}[Y \mid Z]\) (i.e. the conditional mean of \(Y\) does not depend on \(X\) given \(Z\)), not full conditional independence (Lundborg et al., 2024).
Consistent residualisation. \(\hat{m}_h\) and \(\hat{m}_Y\) must converge fast enough for studentised normal calibration; flexible learners with sample splitting typically suffice (Lundborg et al., 2024).
Variable types. Random forest nuisance regressions handle continuous, discrete, or mixed \(X\), \(Y\), and \(Z\) natively; no separate type declaration is required (Shah & Peters, 2020; Lundborg et al., 2024).
Sample size. Both folds need to be large enough for the projection step on the first fold and the test statistic on the second fold (Lundborg et al., 2024).
Optimality. A spline-regression version achieves the minimax optimal rate for this nonparametric testing problem (Lundborg et al., 2024).
Dtype validation is opt-in. Passing data outside the declared dtype produces undefined results; call Test.validate_data(data) to check. citk does not enforce supported_dtypes at construction.

v0.1.0 implementation notes¶

The pycomets backend, regressor (RandomForestRegressor), the projection estimator, and the sample-splitting fold count are not surfaced as constructor kwargs in v0.1.0. Future minor versions may add explicit kwargs additively. Empty conditioning set is handled by substituting a constant column \(Z = 0\), as in :doc:/tests/gcm_test.

Code Example¶import numpy as np
from citk.tests import PCM

# Non-linear chain: X -> Z -> Y
n = 500
X = np.random.randn(n)
Z = np.sin(X) + 0.2 * np.random.randn(n)
Y = Z**2 + 0.2 * np.random.randn(n)
data = np.vstack([X, Y, Z]).T

# Initialize the test (uses pycomets random forest regression with sample splitting)
pcm_test = PCM(data)

# Test for conditional independence of X and Y given Z
# Expected: p-value is large (cannot reject H0 of independence)
p_value_conditional = pcm_test(0, 1, [2])
print(f"P-value for X _||_ Y | Z: {p_value_conditional:.4f}")

# Test for unconditional independence of X and Y
# Expected: p-value is small (reject H0 of independence)
p_value_unconditional = pcm_test(0, 1)
print(f"P-value for X _||_ Y: {p_value_unconditional:.4f}")

API Reference¶

For a full list of parameters, see the API documentation: :class:citk.tests.ml_based_tests.PCM.

References¶

Lundborg, A. R., Kim, I., Shah, R. D., & Samworth, R. J. (2024). The projected covariance measure for assumption-lean variable significance testing. The Annals of Statistics, to appear.

Shah, R. D., & Peters, J. (2020). The hardness of conditional independence testing and the generalised covariance measure. The Annals of Statistics, 48(3), 1514-1538.