# G-Squared Test

The G-Squared test, also known as the G-test or more formally as the likelihood-ratio test for contingency tables, is a conditional independence test for categorical (discrete) data. It is a powerful alternative to the more traditional Pearson's Chi-Squared test and is a standard method used in constraint-based causal discovery algorithms for discrete variables (Spirtes et al., 2000).

The theoretical foundation for the test is Wilks's theorem, which shows that the distribution of the log-likelihood ratio statistic asymptotically approaches a Chi-Square ($\chi^2$) distribution (Wilks, 1938). The G-test is often preferred by statisticians due to its mathematical properties, such as additivity. It is also directly related to information theory, as the G-statistic is proportional to the mutual information between the variables (Cover & Thomas, 2006).

## Mathematical Formulation

The test statistic is calculated from the observed frequencies (O) and the expected frequencies (E) in a contingency table constructed from the data. The expected frequencies are calculated under the null hypothesis of independence. The formula for the G-statistic is:

```{math}
G = 2 \sum_{i} O_i \ln\left(\frac{O_i}{E_i}\right)
```

where the sum is taken over all non-empty cells i in the contingency table. For a conditional independence test of $X \perp Y | Z$, this calculation is performed for each stratum (i.e., for each specific value of the conditioning variable Z), and the resulting G-statistics are summed.

Under the null hypothesis, the total G-statistic is asymptotically distributed as a Chi-Square ($\chi^2$) random variable. The degrees of freedom (df) are calculated as:

```{math}
df = (|X| - 1)(|Y| - 1) \prod_{z \in Z} |z|
```

where $|V|$ denotes the number of distinct categories for a variable V.

## Assumptions

- **Categorical Data**: The variables under consideration must be discrete (categorical).
- **Independent Samples**: The observations are assumed to be drawn independently from the population.
- **Sufficient Sample Size**: As an asymptotic test, its validity depends on the sample size being large enough. While the G-test is often considered more reliable than Pearson's Chi-Squared test for smaller sample sizes (Sokal & Rohlf, 1981), caution is still advised. A common rule of thumb is that the test may be unreliable if more than 20% of the cells in the contingency table have an expected frequency of less than 5.

## Code Example

```python
import numpy as np
from citk.tests import GSq

# Generate discrete data for a chain: X -> Z -> Y
n = 500
X = np.random.randint(0, 3, size=n)
Z = (X + np.random.randint(0, 2, size=n)) % 3
Y = (Z + np.random.randint(0, 2, size=n)) % 3
data = np.vstack([X, Y, Z]).T

# Initialize the test
g_sq_test = GSq(data)

# Test for unconditional independence of X and Y
# Expected: p-value is small (reject H0 of independence)
p_value_unconditional = g_sq_test(0, 1)
print(f"P-value (unconditional) for X _||_ Y: {p_value_unconditional:.4f}")

# Test for conditional independence of X and Y given Z
# Expected: p-value is large (cannot reject H0 of independence)
p_value_conditional = g_sq_test(0, 1, [2])
print(f"P-value (conditional) for X _||_ Y | Z: {p_value_conditional:.4f}")
```

## API Reference

For a full list of parameters, see the API documentation: :class:`citk.tests.simple_tests.GSq`.

## References

Cover, T. M., & Thomas, J. A. (2006). Elements of Information Theory (2nd ed.). Wiley-Interscience.

Sokal, R. R., & Rohlf, F. J. (1981). Biometry: The Principles and Practice of Statistics in Biological Research. W. H. Freeman.

Spirtes, P., Glymour, C. N., & Scheines, R. (2000). Causation, prediction, and search. MIT press.

Wilks, S. S. (1938). The large-sample distribution of the likelihood ratio for testing composite hypotheses. The Annals of Mathematical Statistics, 9(1), 60-62.