G-Squared Test
The G-Squared test, also known as the G-test or more formally as the likelihood-ratio test for contingency tables, is a conditional independence test for categorical (discrete) data. It is a powerful alternative to the more traditional Pearson’s Chi-Squared test and is a standard method used in constraint-based causal discovery algorithms for discrete variables (Spirtes et al., 2000).
The theoretical foundation for the test is Wilks’s theorem, which shows that the distribution of the log-likelihood ratio statistic asymptotically approaches a Chi-Square (\(\chi^2\)) distribution (Wilks, 1938). The G-test is often preferred by statisticians due to its mathematical properties, such as additivity. It is also directly related to information theory, as the G-statistic is proportional to the mutual information between the variables (Cover & Thomas, 2006).
Mathematical Formulation
The test statistic is calculated from the observed frequencies (O) and the expected frequencies (E) in a contingency table constructed from the data. The expected frequencies are calculated under the null hypothesis of independence. The formula for the G-statistic is:
where the sum is taken over all non-empty cells i in the contingency table. For a conditional independence test of \(X \perp Y | Z\), this calculation is performed for each stratum (i.e., for each specific value of the conditioning variable Z), and the resulting G-statistics are summed.
Under the null hypothesis, the total G-statistic is asymptotically distributed as a Chi-Square (\(\chi^2\)) random variable. The degrees of freedom (df) are calculated as:
where \(|V|\) denotes the number of distinct categories for a variable V.
Assumptions
Categorical Data: The variables under consideration must be discrete (categorical).
Independent Samples: The observations are assumed to be drawn independently from the population.
Sufficient Sample Size: As an asymptotic test, its validity depends on the sample size being large enough. While the G-test is often considered more reliable than Pearson’s Chi-Squared test for smaller sample sizes (Sokal & Rohlf, 1981), caution is still advised. A common rule of thumb is that the test may be unreliable if more than 20% of the cells in the contingency table have an expected frequency of less than 5.
Code Example
import numpy as np
from citk.tests import GSq
# Generate discrete data for a chain: X -> Z -> Y
n = 500
X = np.random.randint(0, 3, size=n)
Z = (X + np.random.randint(0, 2, size=n)) % 3
Y = (Z + np.random.randint(0, 2, size=n)) % 3
data = np.vstack([X, Y, Z]).T
# Initialize the test
g_sq_test = GSq(data)
# Test for unconditional independence of X and Y
# Expected: p-value is small (reject H0 of independence)
p_value_unconditional = g_sq_test(0, 1)
print(f"P-value (unconditional) for X _||_ Y: {p_value_unconditional:.4f}")
# Test for conditional independence of X and Y given Z
# Expected: p-value is large (cannot reject H0 of independence)
p_value_conditional = g_sq_test(0, 1, [2])
print(f"P-value (conditional) for X _||_ Y | Z: {p_value_conditional:.4f}")
API Reference
For a full list of parameters, see the API documentation: :class:citk.tests.simple_tests.GSq
.
References
Cover, T. M., & Thomas, J. A. (2006). Elements of Information Theory (2nd ed.). Wiley-Interscience.
Sokal, R. R., & Rohlf, F. J. (1981). Biometry: The Principles and Practice of Statistics in Biological Research. W. H. Freeman.
Spirtes, P., Glymour, C. N., & Scheines, R. (2000). Causation, prediction, and search. MIT press.
Wilks, S. S. (1938). The large-sample distribution of the likelihood ratio for testing composite hypotheses. The Annals of Mathematical Statistics, 9(1), 60-62.