Model Formulations¶

This page describes the mathematical structure implemented by the simulator and the valid combinations of node types, structural equations, and noise models.

Notation¶

Symbol	Meaning
\(G = (V, E)\)	Directed acyclic graph with node set \(V\) and edge set \(E\).
\(j \in V\)	A node (variable) in the graph.
\(\mathrm{Pa}(j)\)	Set of parent nodes of \(j\) in \(G\).
\(X_j\)	Random variable associated with node \(j\).
\(X_{\mathrm{Pa}(j)}\)	The vector of parent values for node \(j\).
\(K\)	Cardinality of a categorical variable (number of classes).
\(\mathcal{D}_j\)	Marginal distribution of an exogenous continuous node \(j\) (Gaussian, Student-t, Gamma, or Exponential).
\(p_j\)	Success probability of an exogenous Bernoulli node \(j\).
\(\pi_{j,k}\)	Class probability for category \(k\) of an exogenous categorical node \(j\); satisfies \(\sum_k \pi_{j,k} = 1\).
\(f_j(\cdot)\)	Structural function mapping parents of \(j\) to its mean signal.
\(\epsilon_j\)	Noise term for node \(j\) (additive, multiplicative, or heteroskedastic).
\(w_{jp}\)	Structural weight from parent \(p\) to child \(j\).
\(d_{jp}\)	Polynomial degree applied to parent \(p\) in the structural form for child \(j\).
\(w_j\)	Single interaction weight in the `interaction` form.
\(\mu_s\)	Mean assigned to categorical-parent stratum \(s\) in the `stratum_means` form.
\(s(\mathbf{x}_{\mathrm{Pa}(j)})\)	Stratum index determined by the categorical parent values.
\(L, H\)	Lower / upper bounds for random structural weight sampling (`random_weight_low`, `random_weight_high`).
\(m\)	Near-zero exclusion radius for random weights (`random_weight_min_abs`).
\(\sigma_j(\cdot)\)	Heteroskedastic noise scale as a function of parents.
\(z\)	Standard normal draw, \(z \sim \mathcal{N}(0, 1)\).
\(\eta_j\)	Latent signal for an endogenous binary node before the logistic link.
\(\sigma(t)\)	Logistic sigmoid, \(\sigma(t) = 1 / (1 + e^{-t})\).
\(\ell_{jk}\)	Logit for class \(k\) of an endogenous categorical node \(j\).
\(b_{jk}\)	Intercept for class \(k\) in the logistic categorical model.
\(g_{jpk}(X_p)\)	Contribution of parent \(p\) to logit \(\ell_{jk}\).
\(\tau_{j1}, \dots, \tau_{j(K-1)}\)	Cut-points used by the threshold categorical model for node \(j\).
\(\perp\!\!\!\perp\)	Conditional independence (used in the CI oracle section).

The simulator draws from two independent random streams: one seeds the data-generating process (DAG topology, structural weights, intercepts, thresholds, stratum means) and the other seeds the per-sample draws (exogenous values, noise, Bernoulli/categorical sampling). They are configured via seed_structure and seed_data respectively, or jointly via a single seed (see the Seeding section in Configuration Examples).

Graph Model¶

The simulator generates a DAG \(G = (V, E)\) using one of:

custom: user-defined node and edge sets
random: random acyclic edges over ordered nodes

Node Types¶

Supported node types:

Continuous
Binary (values in \(\{0, 1\}\))
Categorical (values in \(\{0, \dots, K-1\}\), configurable cardinality \(K\))

Exogenous Nodes (\(\mathrm{Pa}(j)=\varnothing\))¶

Continuous exogenous node:

\[X_j \sim \mathcal{D}_j\]

where \(\mathcal{D}_j\) is one of Gaussian, Student-t, Gamma, or Exponential. Intuition: draw each value of \(X_j\) independently from the chosen marginal distribution.

Binary exogenous node:

\[X_j \sim \mathrm{Bernoulli}(p_j)\]

Intuition: a coin flip that returns 1 with probability \(p_j\) and 0 otherwise.

Categorical exogenous node:

\[X_j \sim \mathrm{Categorical}(\pi_{j,0}, \dots, \pi_{j,K-1}), \quad \sum_k \pi_{j,k}=1\]

Intuition: a weighted dice roll that returns class \(k\) with probability \(\pi_{j,k}\).

Endogenous Continuous Nodes¶

General form:

\[X_j = f_j(X_{\mathrm{Pa}(j)}) + \epsilon_j\]

Intuition: the value of \(X_j\) is a deterministic function of its parents plus an independent noise draw.

Supported structural forms \(f_j\):

Linear:

\[f_j = \sum_{p \in \mathrm{Pa}(j)} w_{jp} X_p\]

Intuition: a weighted sum of the parent values.

Polynomial:

\[f_j = \sum_{p \in \mathrm{Pa}(j)} w_{jp} X_p^{d_{jp}}\]

Intuition: a weighted sum where each parent is first raised to its own fixed power.

Interaction:

\[f_j = w_j \prod_{p \in \mathrm{Pa}(j)} X_p\]

Intuition: the product of all parent values, scaled by a single weight.

Sigmoid (tanh):

\[f_j = w_j \cdot \tanh\!\left( \sum_{p \in \mathrm{Pa}(j)} w_{jp} X_p \right)\]

Intuition: a smooth saturating nonlinearity — the weighted parent sum is squashed by tanh and rescaled by an output weight \(w_j\).

Cosine:

\[f_j = \cos\!\left( \sum_{p \in \mathrm{Pa}(j)} w_{jp} X_p \right)\]

Sine:

\[f_j = \sin\!\left( \sum_{p \in \mathrm{Pa}(j)} w_{jp} X_p \right)\]

Intuition: the parent values are first combined linearly, then passed through a periodic nonlinearity. Useful for stress-testing kernel-based CI tests on oscillatory dependencies.

Stratum-specific means (categorical parents to continuous child):

\[f_j = \mu_{s(\mathbf{x}_{\mathrm{Pa}(j)})}\]

where \(s(\cdot)\) indexes the categorical parent stratum. Intuition: look up a pre-assigned mean for the combination of categorical parent values observed at this row.

When stratum_means is used with mixed parents (at least one categorical parent plus one or more metric parents), the structural function combines a stratum mean with a linear contribution from the metric parents:

\[f_j = \mu_{s(\mathbf{x}_{\mathrm{cat}})} + \sum_{p \in \text{metric parents}} w_{jp} X_p\]

The metric weights can be set explicitly via functional_form.metric_weights (a dict per parent or a single number applied to all metric parents), or sampled from the random-weight distribution if omitted.

Random structural weights¶

When weights are omitted for linear, polynomial, or interaction, the simulator samples weights from a configurable interval:

\[w \sim \mathrm{Uniform}(L, H)\]

where L=random_weight_low and H=random_weight_high. Intuition: when you don’t pin a weight, it’s drawn uniformly between \(L\) and \(H\).

If random_weight_min_abs = m > 0, values in \((-m, m)\) are excluded and weights are sampled from:

\[[L, -m] \cup [m, H]\]

This guarantees a minimum signal strength on every edge, giving you direct control over how strongly each parent influences its child rather than letting random sampling produce effectively-zero coefficients. Intuition: every edge contributes at least \(m\) worth of signal, so no parent ends up silently muted by the random draw.

Noise models:

Additive:

\[X_j = f_j + \epsilon_j\]

Intuition: the noise is added on top of the structural signal.

Additive noise distributions accepted under noise_model.dist:

gaussian (parameter std)
student_t (parameters df, scale)
gamma (parameters shape, scale; centered to zero mean)
exponential (parameter scale; centered to zero mean)
laplace (parameter scale; zero-centered)
cauchy (parameter scale; zero-centered, heavy-tailed)
uniform (parameter scale; symmetric on \([-\text{scale}, \text{scale}]\))

Multiplicative:

\[X_j = f_j \cdot (1 + \epsilon_j')\]

Intuition: the noise scales the structural signal, so the spread grows with the magnitude of \(f_j\).

Multiplicative noise also supports gaussian, student_t, gamma, and exponential distributions for \(\epsilon_j'\). Gamma and exponential factors are normalized to mean 1 so the structural signal is not biased; all factors are clipped to a small positive minimum for numerical safety.

Heteroskedastic:

\[X_j = f_j + \sigma_j(X_{\mathrm{Pa}(j)}) z, \quad z \sim \mathcal{N}(0,1)\]

Intuition: additive Gaussian noise whose standard deviation depends on the parent values.

with registered \(\sigma_j(\cdot)\) choices:

abs_first_parent (default when func is omitted)
abs_parent_plus_const
mean_abs_plus_const

Post-nonlinear transform¶

Any continuous endogenous node may apply a final element-wise nonlinearity to its output after the structural function and noise have been combined:

\[X_j \leftarrow g(X_j)\]

where \(g\) is selected by post_transform.name from the registry:

Name	Function
`tanh`	\(\tanh(x)\)
`sin`	\(\sin(x)\)
`cos`	\(\cos(x)\)
`exp_neg_abs`	\(\exp(-\|x\|)\)
`sqrt_abs`	\(\sqrt{\|x\|}\)
`relu`	\(\max(0, x)\)
`sign`	\(\mathrm{sign}(x)\)

Intuition: the structural function and noise model determine the signal; post_transform warps that signal afterwards. This is how the literature typically realizes “post-nonlinear” DGPs (e.g., \(Y = \tanh(\text{linear}(X) + \epsilon)\)).

Endogenous Binary Nodes¶

Binary children use a logistic link on the latent signal:

\[\eta_j = f_j(X_{\mathrm{Pa}(j)}) + \epsilon_j\]

Intuition: build a continuous latent score from the parents and a noise term.

\[\Pr(X_j=1 \mid X_{\mathrm{Pa}(j)}) = \sigma(\eta_j), \quad \sigma(t)=\frac{1}{1+e^{-t}}\]

Intuition: squash the latent score into a probability between 0 and 1.

\[X_j \sim \mathrm{Bernoulli}\!\left(\sigma(\eta_j)\right)\]

Intuition: flip a biased coin with that probability to decide whether \(X_j\) is 0 or 1.

Endogenous Categorical Nodes¶

Two models are supported.

Logistic (multinomial softmax)

\[\ell_{jk} = b_{jk} + \sum_{p \in \mathrm{Pa}(j)} g_{jpk}(X_p)\]

Intuition: compute one logit per class as an intercept plus parent contributions.

\[\Pr(X_j=k \mid X_{\mathrm{Pa}(j)}) = \frac{\exp(\ell_{jk})}{\sum_{m=0}^{K-1} \exp(\ell_{jm})}\]

Intuition: convert the logits into class probabilities via softmax, then sample a class from that distribution.

where \(g_{jpk}\) depends on parent type:

continuous/binary parent: linear contribution per class — weights[parent] is a length-\(K\) vector, one coefficient per child class.
categorical parent: class-specific lookup via a parent-category weight matrix of shape \((K_{\text{parent}}, K)\) — one row per parent class, one column per child class.

Threshold (continuous-to-categorical)

\[s_j = \sum_{p \in \mathrm{Pa}(j)} w_{jp} X_p\]

Intuition: form a continuous score from a weighted sum of parents.

\[X_j = \mathrm{digitize}(s_j; \tau_{j1}, \dots, \tau_{j(K-1)})\]

Intuition: assign a class based on which bin the score falls into, defined by the cut-points \(\tau_{j1}, \dots, \tau_{j(K-1)}\).

If thresholds are not provided, defaults are set from a theoretical Gaussian quantile grid, not from realized sample quantiles. By default:

threshold_loc = 0.0
threshold_scale is sampled from Uniform(0.5, 2.0)

You can override both explicitly in config.

Compatibility Matrix¶

Supported combinations¶
Child type	Parent types	Structural model	Noise / link
Continuous	Continuous, binary, categorical, or mixed	`linear`, `polynomial`, `interaction`, `sigmoid`, `cos`, `sin`, `stratum_means` (+ optional `post_transform`)	`additive`, `multiplicative`, `heteroskedastic`
Binary	Continuous, binary, categorical, or mixed	`linear`, `polynomial`, `interaction`, `sigmoid`, `cos`, `sin`, `stratum_means`	Latent signal + noise, then logistic link and Bernoulli draw
Categorical	Continuous, binary, categorical, or mixed	`categorical_model = logistic` or `categorical_model = threshold`	Softmax sampling (logistic) or threshold digitization

For random structural weights, additional controls are: random_weight_low, random_weight_high, and random_weight_min_abs. The same random_weight_min_abs exclusion is applied to auto-sampled categorical logistic weights as well.

Forced uniform marginals¶

Set simulation_params.force_uniform_marginals = true to override the default randomized marginals on exogenous nodes:

Exogenous binary (no explicit p): the simulator uses \(p = 0.5\) and generates an exact balanced 0/1 split rather than sampling \(X_j \sim \mathrm{Bernoulli}(0.5)\), eliminating small-sample fluctuations.
Exogenous categorical (no explicit probs): the simulator uses uniform \(\pi_{j,k} = 1/K\) and enforces equal counts per class (with a small remainder distributed at random).
Exogenous continuous: unchanged — distributional parameters are still sampled or read from the config.

If p (binary) or probs (categorical) is explicitly provided, the flag is ignored for that node and your config wins.

Random node-type assignment¶

When graph_params.type = "random" and a node’s type is not pinned in node_params, the simulator samples a type per node according to:

simulation_params.binary_proportion (default 0.4)
simulation_params.categorical_proportion (default 0.0)
the remainder becomes continuous

Categorical parents in metric forms¶

Using categorical parents with linear, polynomial, or interaction is blocked by default (categorical_parent_metric_form_policy = "error"), because treating category codes as metric values can distort the intended DGP.

Set categorical_parent_metric_form_policy = "stratum_means" to auto-redirect such cases to stratum_means.

For mixed parents (categorical + continuous/binary), redirected stratum_means uses:

\[f_j = \mu_{\text{cat-stratum}} + \sum_{p \in \text{metric parents}} w_p X_p\]

where categorical parents select the stratum mean and metric parents contribute an additive linear term.

Stratum means reproducibility¶

For stratum_means with multiple categorical parents, all strata are pre-enumerated and assigned means upfront, ensuring stable DGP parameters even for rare/unseen strata in a particular sample.

CI Oracle (Ground Truth)¶

If simulation_params.store_ci_oracle = true, the simulator stores conditional independence truth values from DAG d-separation:

\[X \perp\!\!\!\perp Y \mid S \iff S \text{ is a d-separator of } X \text{ and } Y \text{ in } G\]

for conditioning sets up to ci_oracle_max_cond_set. Intuition: the oracle records, for every triple \((X, Y, S)\), whether the DAG structure forces \(X\) and \(Y\) to be independent given \(S\) — useful as ground truth for evaluating CI tests.