Tutorial: a first PC run¶
This tutorial walks through one full invocation of the PC algorithm on observational data, from data simulation through CPDAG recovery to structural evaluation. The reader is assumed to be familiar with directed acyclic graphs, conditional independence, and the standard faithfulness / causal Markov assumptions; for the underlying theory see Explanation: PC theory.
Setup¶
We use the sister package dagsampler
to simulate data from a known DAG so that the recovered CPDAG can
be compared against ground truth. cbcd does not depend on
dagsampler; the two packages communicate through the structural
cbcd.CITest Protocol.
from dagsampler import CausalDataGenerator
from cbcd import pc
cfg = {
"simulation_params": {
"n_samples": 3000,
"seed_structure": 1,
"seed_data": 2,
"binary_proportion": 0.0,
},
"graph_params": {
"type": "custom",
"nodes": ["A", "B", "C", "D"],
"edges": [["A", "B"], ["A", "C"],
["B", "D"], ["C", "D"]],
},
}
result = CausalDataGenerator(cfg).simulate()
The DAG A → B, A → C, B → D, C → D has a single v-structure
(B → D ← C). Under faithfulness the only orientation that PC can
recover from observational data alone is the collider into D;
the upper edges (A — B, A — C) belong to the same Markov
equivalence class as their reversals and are returned undirected
in the CPDAG.
Recovery¶
The default invocation uses cbcd’s bundled Fisher–Z partial correlation test:
cpdag = pc(result["data"], alpha=0.05)
Under the cbcd.CITest Protocol any conforming object may be
substituted; for example,
oracle = result_gen.as_ci_oracle() # dagsampler.DSeparationOracle
cpdag_oracle = pc(result["data"], ci_test=oracle, alpha=0.05)
returns the CPDAG that PC would recover under a perfect d-separation
oracle. Because PC is sound and complete under faithfulness with a
perfect oracle, cpdag_oracle is the true CPDAG of the generating
DAG; it provides a principled benchmark against which the empirical
recovery cpdag can be scored. See
Explanation: PC theory for the soundness
and completeness statements.
Reading the output¶
cpdag is an instance of cbcd.CPDAG, internally backed by an
int8 endpoint matrix following the convention in
D5 (docs/design/api_v0.py):
value |
meaning |
|---|---|
|
no edge |
|
TAIL — the |
|
ARROW — the |
A directed edge \(i \to j\) is encoded as endpoints[i, j] == ARROW
and endpoints[j, i] == TAIL; an undirected edge as both ends
TAIL. For the diamond fixture above, cpdag.endpoints will show
B → D and C → D oriented (the v-structure) with A — B and
A — C left undirected.
print(cpdag.endpoints)
# array([[0, 1, 1, 0],
# [1, 0, 0, 2],
# [1, 0, 0, 2],
# [0, 1, 1, 0]], dtype=int8)
Choosing α¶
The significance level \(\alpha\) controls the type-I error rate of each individual CI test. Under independent CI tests, the family-wise type-I rate of the skeleton phase grows with the number of performed tests; in practice cbcd does not apply a multiple-testing correction, and the user is expected to choose \(\alpha\) in light of sample size and graph density. The default \(\alpha = 0.05\) is common in the literature (Spirtes et al., 2000). For large \(n\), tighter values (e.g. \(\alpha = 0.01\)) reduce spurious adjacencies; for small \(n\), looser values may be required to retain power.
A more thorough treatment of this tradeoff appears in How-to: choosing a CI test.
Evaluation against the oracle¶
Endpoint-by-endpoint structural distance between the empirical and
oracle CPDAGs is reported by the bnm package via its
Structural Hamming Distance
function. Like the dagsampler ↔ cbcd connection, cbcd ↔ bnm passes
through a structural Protocol (bnm.GraphLike) — neither package
imports the other.
import bnm
shd = bnm.shd(cpdag_oracle, cpdag)
f1 = bnm.f1(cpdag_oracle, cpdag)
For the diamond fixture at \(n = 3000\) with default Fisher–Z and
\(\alpha = 0.05\), this yields \(\mathrm{SHD} = 2\) and \(F_1 = 0.50\).
The two-edge gap is in the upper edges (A — B, A — C): the
oracle leaves them undirected, while Fisher–Z at this sample size
spuriously orients them in one direction. This kind of orientation
drift on undirected CPDAG edges is documented and expected; it is
not a regression in PC’s logic but a power limitation of the
finite-sample CI test.
What is next¶
A different CI test for non-Gaussian or discrete data — How-to: choosing a CI test.
The full algorithmic theory and assumptions — Explanation: PC theory.
Time-series data —
cbcd.pcmci, treated under the same Diátaxis structure in the time-series tutorial.
References¶
Spirtes, P., Glymour, C., & Scheines, R. (2000). Causation, Prediction, and Search (2nd ed.). MIT Press.