Chi-Squared Test
The Chi-Squared (\(\chi^2\)) test is a classical statistical test for categorical (discrete) data. Developed by Karl Pearson at the turn of the 20th century, it was one of the first “goodness-of-fit” tests, designed to assess whether an observed set of frequencies differs from a theoretical distribution (Pearson, 1900). In the context of contingency tables, it is used to test for the conditional independence of two variables, X and Y, given a set of variables, Z.
Mathematical Formulation
The test compares the observed frequencies (O) in a contingency table with the frequencies that would be expected (E) if the null hypothesis of independence were true. The Pearson Chi-Squared statistic is calculated as:
where the sum is over all cells i in the contingency table. This formula can be seen as a second-order Taylor approximation to the log-likelihood ratio (G-test) statistic, to which it is asymptotically equivalent.
Under the null hypothesis, this statistic follows a Chi-Squared (\(\chi^2\)) distribution with degrees of freedom given by:
where \(|V|\) denotes the number of distinct categories for a variable V.
Assumptions
Categorical Data: The data must be categorical (discrete).
Independent Observations: The individual observations must be independent of each other.
Sufficient Sample Size: The sample size should be large enough that the expected frequency in each cell is not too small. A widely cited rule of thumb, often attributed to Cochran, suggests that the test may be inappropriate if more than 20% of the cells have an expected frequency below 5, or if any cell has an expected frequency below 1 (Cochran, 1954).
Code Example
import numpy as np
from citk.tests import ChiSq
# Generate discrete data representing a collider: X -> Y <- Z
n = 500
X = np.random.randint(0, 2, size=n)
Z = np.random.randint(0, 2, size=n)
Y = (X + Z + np.random.randint(0, 2, size=n)) % 2
data = np.vstack([X, Y, Z]).T
# Initialize the test
chisq_test = ChiSq(data)
# Test for unconditional independence (X and Z are independent)
p_value_unconditional = chisq_test(0, 2)
print(f"P-value for X _||_ Z: {p_value_unconditional:.4f}")
# Test for conditional dependence on the collider Y
p_value_conditional = chisq_test(0, 2, [1])
print(f"P-value for X _||_ Z | Y: {p_value_conditional:.4f}")
API Reference
For a full list of parameters, see the API documentation: :class:citk.tests.simple_tests.ChiSq
.
References
Cochran, W. G. (1954). Some methods for strengthening the common \(\chi^2\) tests. Biometrics, 10(4), 417-451.
Pearson, K. (1900). On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling. Philosophical Magazine, 50(302), 157-175.