Model Formulations¶
This page describes the mathematical structure implemented by the simulator and the valid combinations of node types, structural equations, and noise models.
Notation¶
Symbol |
Meaning |
|---|---|
\(G = (V, E)\) |
Directed acyclic graph with node set \(V\) and edge set \(E\). |
\(j \in V\) |
A node (variable) in the graph. |
\(\mathrm{Pa}(j)\) |
Set of parent nodes of \(j\) in \(G\). |
\(X_j\) |
Random variable associated with node \(j\). |
\(X_{\mathrm{Pa}(j)}\) |
The vector of parent values for node \(j\). |
\(K\) |
Cardinality of a categorical variable (number of classes). |
\(\mathcal{D}_j\) |
Marginal distribution of an exogenous continuous node \(j\) (Gaussian, Student-t, Gamma, or Exponential). |
\(p_j\) |
Success probability of an exogenous Bernoulli node \(j\). |
\(\pi_{j,k}\) |
Class probability for category \(k\) of an exogenous categorical node \(j\); satisfies \(\sum_k \pi_{j,k} = 1\). |
\(f_j(\cdot)\) |
Structural function mapping parents of \(j\) to its mean signal. |
\(\epsilon_j\) |
Noise term for node \(j\) (additive, multiplicative, or heteroskedastic). |
\(w_{jp}\) |
Structural weight from parent \(p\) to child \(j\). |
\(d_{jp}\) |
Polynomial degree applied to parent \(p\) in the structural form for child \(j\). |
\(w_j\) |
Single interaction weight in the |
\(\mu_s\) |
Mean assigned to categorical-parent stratum \(s\) in the
|
\(s(\mathbf{x}_{\mathrm{Pa}(j)})\) |
Stratum index determined by the categorical parent values. |
\(L, H\) |
Lower / upper bounds for random structural weight sampling
( |
\(m\) |
Near-zero exclusion radius for random weights
( |
\(\sigma_j(\cdot)\) |
Heteroskedastic noise scale as a function of parents. |
\(z\) |
Standard normal draw, \(z \sim \mathcal{N}(0, 1)\). |
\(\eta_j\) |
Latent signal for an endogenous binary node before the logistic link. |
\(\sigma(t)\) |
Logistic sigmoid, \(\sigma(t) = 1 / (1 + e^{-t})\). |
\(\ell_{jk}\) |
Logit for class \(k\) of an endogenous categorical node \(j\). |
\(b_{jk}\) |
Intercept for class \(k\) in the logistic categorical model. |
\(g_{jpk}(X_p)\) |
Contribution of parent \(p\) to logit \(\ell_{jk}\). |
\(\tau_{j1}, \dots, \tau_{j(K-1)}\) |
Cut-points used by the threshold categorical model for node \(j\). |
\(\perp\!\!\!\perp\) |
Conditional independence (used in the CI oracle section). |
The simulator draws from two independent random streams: one seeds the
data-generating process (DAG topology, structural weights, intercepts,
thresholds, stratum means) and the other seeds the per-sample draws
(exogenous values, noise, Bernoulli/categorical sampling). They are configured
via seed_structure and seed_data respectively, or jointly via a single
seed (see the Seeding section in Configuration Examples).
Graph Model¶
The simulator generates a DAG \(G = (V, E)\) using one of:
custom: user-defined node and edge setsrandom: random acyclic edges over ordered nodes
Node Types¶
Supported node types:
Continuous
Binary (values in \(\{0, 1\}\))
Categorical (values in \(\{0, \dots, K-1\}\), configurable cardinality \(K\))
Exogenous Nodes (\(\mathrm{Pa}(j)=\varnothing\))¶
Continuous exogenous node:
where \(\mathcal{D}_j\) is one of Gaussian, Student-t, Gamma, or Exponential. Intuition: draw each value of \(X_j\) independently from the chosen marginal distribution.
Binary exogenous node:
Intuition: a coin flip that returns 1 with probability \(p_j\) and 0 otherwise.
Categorical exogenous node:
Intuition: a weighted dice roll that returns class \(k\) with probability \(\pi_{j,k}\).
Endogenous Continuous Nodes¶
General form:
Intuition: the value of \(X_j\) is a deterministic function of its parents plus an independent noise draw.
Supported structural forms \(f_j\):
Linear:
Intuition: a weighted sum of the parent values.
Polynomial:
Intuition: a weighted sum where each parent is first raised to its own fixed power.
Interaction:
Intuition: the product of all parent values, scaled by a single weight.
Sigmoid (tanh):
Intuition: a smooth saturating nonlinearity — the weighted parent sum is
squashed by tanh and rescaled by an output weight \(w_j\).
Cosine:
Sine:
Intuition: the parent values are first combined linearly, then passed through a periodic nonlinearity. Useful for stress-testing kernel-based CI tests on oscillatory dependencies.
Stratum-specific means (categorical parents to continuous child):
where \(s(\cdot)\) indexes the categorical parent stratum. Intuition: look up a pre-assigned mean for the combination of categorical parent values observed at this row.
When stratum_means is used with mixed parents (at least one categorical
parent plus one or more metric parents), the structural function combines a
stratum mean with a linear contribution from the metric parents:
The metric weights can be set explicitly via functional_form.metric_weights
(a dict per parent or a single number applied to all metric parents), or
sampled from the random-weight distribution if omitted.
Random structural weights¶
When weights are omitted for linear, polynomial, or interaction,
the simulator samples weights from a configurable interval:
where L=random_weight_low and H=random_weight_high.
Intuition: when you don’t pin a weight, it’s drawn uniformly between
\(L\) and \(H\).
If random_weight_min_abs = m > 0, values in \((-m, m)\) are excluded
and weights are sampled from:
This guarantees a minimum signal strength on every edge, giving you direct control over how strongly each parent influences its child rather than letting random sampling produce effectively-zero coefficients. Intuition: every edge contributes at least \(m\) worth of signal, so no parent ends up silently muted by the random draw.
Noise models:
Additive:
Intuition: the noise is added on top of the structural signal.
Additive noise distributions accepted under noise_model.dist:
gaussian(parameterstd)student_t(parametersdf,scale)gamma(parametersshape,scale; centered to zero mean)exponential(parameterscale; centered to zero mean)laplace(parameterscale; zero-centered)cauchy(parameterscale; zero-centered, heavy-tailed)uniform(parameterscale; symmetric on \([-\text{scale}, \text{scale}]\))
Multiplicative:
Intuition: the noise scales the structural signal, so the spread grows with the magnitude of \(f_j\).
Multiplicative noise also supports gaussian, student_t, gamma,
and exponential distributions for \(\epsilon_j'\). Gamma and
exponential factors are normalized to mean 1 so the structural signal is not
biased; all factors are clipped to a small positive minimum for numerical
safety.
Heteroskedastic:
Intuition: additive Gaussian noise whose standard deviation depends on the parent values.
with registered \(\sigma_j(\cdot)\) choices:
abs_first_parent(default whenfuncis omitted)abs_parent_plus_constmean_abs_plus_const
Post-nonlinear transform¶
Any continuous endogenous node may apply a final element-wise nonlinearity to its output after the structural function and noise have been combined:
where \(g\) is selected by post_transform.name from the registry:
Name |
Function |
|---|---|
|
\(\tanh(x)\) |
|
\(\sin(x)\) |
|
\(\cos(x)\) |
|
\(\exp(-|x|)\) |
|
\(\sqrt{|x|}\) |
|
\(\max(0, x)\) |
|
\(\mathrm{sign}(x)\) |
Intuition: the structural function and noise model determine the signal;
post_transform warps that signal afterwards. This is how the literature
typically realizes “post-nonlinear” DGPs (e.g., \(Y = \tanh(\text{linear}(X) + \epsilon)\)).
Endogenous Binary Nodes¶
Binary children use a logistic link on the latent signal:
Intuition: build a continuous latent score from the parents and a noise term.
Intuition: squash the latent score into a probability between 0 and 1.
Intuition: flip a biased coin with that probability to decide whether \(X_j\) is 0 or 1.
Endogenous Categorical Nodes¶
Two models are supported.
Logistic (multinomial softmax)
Intuition: compute one logit per class as an intercept plus parent contributions.
Intuition: convert the logits into class probabilities via softmax, then sample a class from that distribution.
where \(g_{jpk}\) depends on parent type:
continuous/binary parent: linear contribution per class —
weights[parent]is a length-\(K\) vector, one coefficient per child class.categorical parent: class-specific lookup via a parent-category weight matrix of shape \((K_{\text{parent}}, K)\) — one row per parent class, one column per child class.
Threshold (continuous-to-categorical)
Intuition: form a continuous score from a weighted sum of parents.
Intuition: assign a class based on which bin the score falls into, defined by the cut-points \(\tau_{j1}, \dots, \tau_{j(K-1)}\).
If thresholds are not provided, defaults are set from a theoretical Gaussian quantile grid, not from realized sample quantiles. By default:
threshold_loc = 0.0threshold_scaleis sampled fromUniform(0.5, 2.0)
You can override both explicitly in config.
Compatibility Matrix¶
Child type |
Parent types |
Structural model |
Noise / link |
|---|---|---|---|
Continuous |
Continuous, binary, categorical, or mixed |
|
|
Binary |
Continuous, binary, categorical, or mixed |
|
Latent signal + noise, then logistic link and Bernoulli draw |
Categorical |
Continuous, binary, categorical, or mixed |
|
Softmax sampling (logistic) or threshold digitization |
For random structural weights, additional controls are:
random_weight_low, random_weight_high, and random_weight_min_abs.
The same random_weight_min_abs exclusion is applied to auto-sampled
categorical logistic weights as well.
Forced uniform marginals¶
Set simulation_params.force_uniform_marginals = true to override the
default randomized marginals on exogenous nodes:
Exogenous binary (no explicit
p): the simulator uses \(p = 0.5\) and generates an exact balanced 0/1 split rather than sampling \(X_j \sim \mathrm{Bernoulli}(0.5)\), eliminating small-sample fluctuations.Exogenous categorical (no explicit
probs): the simulator uses uniform \(\pi_{j,k} = 1/K\) and enforces equal counts per class (with a small remainder distributed at random).Exogenous continuous: unchanged — distributional parameters are still sampled or read from the config.
If p (binary) or probs (categorical) is explicitly provided, the flag
is ignored for that node and your config wins.
Random node-type assignment¶
When graph_params.type = "random" and a node’s type is not pinned in
node_params, the simulator samples a type per node according to:
simulation_params.binary_proportion(default0.4)simulation_params.categorical_proportion(default0.0)the remainder becomes continuous
Categorical parents in metric forms¶
Using categorical parents with linear, polynomial, or interaction
is blocked by default (categorical_parent_metric_form_policy = "error"),
because treating category codes as metric values can distort the intended DGP.
Set categorical_parent_metric_form_policy = "stratum_means" to auto-redirect
such cases to stratum_means.
For mixed parents (categorical + continuous/binary), redirected stratum_means
uses:
where categorical parents select the stratum mean and metric parents contribute an additive linear term.
Stratum means reproducibility¶
For stratum_means with multiple categorical parents, all strata are
pre-enumerated and assigned means upfront, ensuring stable DGP parameters even
for rare/unseen strata in a particular sample.
CI Oracle (Ground Truth)¶
If simulation_params.store_ci_oracle = true, the simulator stores conditional
independence truth values from DAG d-separation:
for conditioning sets up to ci_oracle_max_cond_set.
Intuition: the oracle records, for every triple \((X, Y, S)\), whether
the DAG structure forces \(X\) and \(Y\) to be independent given
\(S\) — useful as ground truth for evaluating CI tests.