SCM formulation

This page describes the mathematical structure that dagsampler instantiates and the valid combinations of node types, structural equations, and noise models.

Notation

Symbol

Meaning

\(G = (V, E)\)

Directed acyclic graph with node set \(V\) and edge set \(E\).

\(j \in V\)

A node (variable) in the graph.

\(\mathrm{Pa}(j)\)

Set of parent nodes of \(j\) in \(G\).

\(X_j\)

Random variable associated with node \(j\).

\(X_{\mathrm{Pa}(j)}\)

The vector of parent values for node \(j\).

\(K\)

Cardinality of a categorical variable (number of classes).

\(\mathcal{D}_j\)

Marginal distribution of an exogenous continuous node \(j\) (Gaussian, Student-t, gamma, or exponential).

\(p_j\)

Success probability of an exogenous Bernoulli node \(j\).

\(\pi_{j,k}\)

Class probability for category \(k\) of an exogenous categorical node \(j\); satisfies \(\sum_k \pi_{j,k} = 1\).

\(f_j(\cdot)\)

Structural function mapping parents of \(j\) to its mean signal.

\(\epsilon_j\)

Noise term for node \(j\) (additive, multiplicative, or heteroskedastic).

\(\sigma_j(\cdot)\)

Heteroskedastic noise scale as a function of parents.

\(z\)

Standard normal draw, \(z \sim \mathcal{N}(0, 1)\).

\(\eta_j\)

Latent signal for an endogenous binary node before the logistic link.

\(\sigma(t)\)

Logistic sigmoid, \(\sigma(t) = 1 / (1 + e^{-t})\).

\(\ell_{jk}\)

Logit for class \(k\) of an endogenous categorical node \(j\).

\(b_{jk}\)

Intercept for class \(k\) in the logistic categorical model.

\(g_{jpk}(X_p)\)

Contribution of parent \(p\) to logit \(\ell_{jk}\).

\(\tau_{j1}, \dots, \tau_{j(K-1)}\)

Cut-points used by the threshold categorical model for node \(j\).

\(\perp\!\!\!\perp\)

Conditional independence (used in the CI oracle section).

The simulator draws from two independent random streams: one seeds the data-generating process (DAG topology, structural weights, intercepts, thresholds, stratum means) and the other seeds the per-sample draws (exogenous values, noise, Bernoulli / categorical sampling). They are configured via seed_structure and seed_data respectively, or jointly via a single seed (see the seeding section in the configuration cookbook).

Graph model

The simulator generates a DAG \(G = (V, E)\) using one of:

  • custom — user-defined node and edge sets.

  • random — random acyclic edges over ordered nodes.

Node types

Supported node types:

  • Continuous.

  • Binary (values in \(\{0, 1\}\)).

  • Categorical (values in \(\{0, \dots, K-1\}\), configurable cardinality \(K\)).

Exogenous nodes (\(\mathrm{Pa}(j) = \varnothing\))

Continuous exogenous node:

\[ X_j \sim \mathcal{D}_j \]

where \(\mathcal{D}_j\) is one of Gaussian, Student-t, gamma, or exponential.

Binary exogenous node:

\[ X_j \sim \mathrm{Bernoulli}(p_j). \]

Categorical exogenous node:

\[ X_j \sim \mathrm{Categorical}(\pi_{j,0}, \dots, \pi_{j,K-1}), \quad \sum_k \pi_{j,k} = 1. \]

Endogenous continuous nodes

General form:

\[ X_j = f_j(X_{\mathrm{Pa}(j)}) + \epsilon_j. \]

Supported structural forms \(f_j\) are listed below.

Linear.

\[ f_j = \sum_{p \in \mathrm{Pa}(j)} w_{jp} X_p. \]

Polynomial.

\[ f_j = \sum_{p \in \mathrm{Pa}(j)} w_{jp} X_p^{d_{jp}}. \]

Interaction.

\[ f_j = w_j \prod_{p \in \mathrm{Pa}(j)} X_p. \]

Sigmoid (tanh).

\[ f_j = w_j \cdot \tanh\!\left( \sum_{p \in \mathrm{Pa}(j)} w_{jp} X_p \right). \]

A smooth saturating nonlinearity — the weighted parent sum is squashed by tanh and rescaled by an output weight \(w_j\).

Cosine.

\[ f_j = \cos\!\left( \sum_{p \in \mathrm{Pa}(j)} w_{jp} X_p \right). \]

Sine.

\[ f_j = \sin\!\left( \sum_{p \in \mathrm{Pa}(j)} w_{jp} X_p \right). \]

The parent values are first combined linearly, then passed through a periodic nonlinearity. Useful for stress-testing kernel-based CI tests on oscillatory dependencies.

Stratum-specific means (categorical parents to continuous child).

\[ f_j = \mu_{s(\mathbf{x}_{\mathrm{Pa}(j)})}, \]

where \(s(\cdot)\) indexes the categorical parent stratum.

When stratum_means is used with mixed parents (at least one categorical parent plus one or more metric parents), the structural function combines a stratum mean with a linear contribution from the metric parents:

\[ f_j = \mu_{s(\mathbf{x}_{\mathrm{cat}})} + \sum_{p \in \text{metric parents}} w_{jp} X_p. \]

The metric weights can be set explicitly via functional_form.metric_weights (a dict per parent or a single number applied to all metric parents), or sampled from the random-weight distribution if omitted.

Random structural weights

When weights are omitted for linear, polynomial, or interaction, the simulator samples weights from a configurable interval:

\[ w \sim \mathrm{Uniform}(L, H), \]

where \(L =\) random_weight_low and \(H =\) random_weight_high.

If random_weight_min_abs = m > 0, values in \((-m, m)\) are excluded and weights are sampled from:

\[ [L, -m] \cup [m, H]. \]

This guarantees a minimum signal strength on every edge, giving direct control over how strongly each parent influences its child rather than letting random sampling produce effectively-zero coefficients.

Noise models

Additive.

\[ X_j = f_j + \epsilon_j. \]

Additive noise distributions accepted under noise_model.dist:

  • gaussian (parameter std)

  • student_t (parameters df, scale)

  • gamma (parameters shape, scale; centred to zero mean)

  • exponential (parameter scale; centred to zero mean)

  • laplace (parameter scale; zero-centred)

  • cauchy (parameter scale; zero-centred, heavy-tailed)

  • uniform (parameter scale; symmetric on \([-\text{scale}, \text{scale}]\))

Multiplicative.

\[ X_j = f_j \cdot (1 + \epsilon_j'). \]

The noise scales the structural signal, so the spread grows with the magnitude of \(f_j\). Multiplicative noise also supports gaussian, student_t, gamma, and exponential distributions for \(\epsilon_j'\). Gamma and exponential factors are normalised to mean 1 so the structural signal is not biased; all factors are clipped to a small positive minimum for numerical safety.

Heteroskedastic.

\[ X_j = f_j + \sigma_j(X_{\mathrm{Pa}(j)})\, z, \quad z \sim \mathcal{N}(0, 1). \]

Additive Gaussian noise whose standard deviation depends on the parent values. Registered \(\sigma_j(\cdot)\) choices:

  • abs_first_parent (default when func is omitted)

  • abs_parent_plus_const

  • mean_abs_plus_const

Post-nonlinear transform

Any continuous endogenous node may apply a final element-wise nonlinearity to its output after the structural function and noise have been combined:

\[ X_j \leftarrow g(X_j), \]

where \(g\) is selected by post_transform.name from the registry:

Name

Function

tanh

\(\tanh(x)\)

sin

\(\sin(x)\)

cos

\(\cos(x)\)

exp_neg_abs

$\exp(-

sqrt_abs

$\sqrt{

relu

\(\max(0, x)\)

sign

\(\mathrm{sign}(x)\)

The structural function and noise model determine the signal; post_transform warps that signal afterwards. This is how the literature typically realises post-nonlinear DGPs (Zhang & Hyvärinen, 2009): e.g. \(Y = \tanh(\text{linear}(X) + \epsilon)\).

Endogenous binary nodes

Binary children use a logistic link on the latent signal:

\[ \eta_j = f_j(X_{\mathrm{Pa}(j)}) + \epsilon_j, \]
\[ \Pr(X_j = 1 \mid X_{\mathrm{Pa}(j)}) = \sigma(\eta_j), \quad \sigma(t) = \frac{1}{1 + e^{-t}}, \]
\[ X_j \sim \mathrm{Bernoulli}\!\left(\sigma(\eta_j)\right). \]

Endogenous categorical nodes

Two models are supported.

Logistic (multinomial softmax)

\[ \ell_{jk} = b_{jk} + \sum_{p \in \mathrm{Pa}(j)} g_{jpk}(X_p), \]
\[ \Pr(X_j = k \mid X_{\mathrm{Pa}(j)}) = \frac{\exp(\ell_{jk})}{\sum_{m=0}^{K-1} \exp(\ell_{jm})}. \]

The parent contribution \(g_{jpk}\) depends on the parent type:

  • continuous / binary parent: linear contribution per class — weights[parent] is a length-\(K\) vector, one coefficient per child class.

  • categorical parent: class-specific lookup via a parent-category weight matrix of shape \((K_{\text{parent}}, K)\) — one row per parent class, one column per child class.

Threshold (continuous-to-categorical)

\[ s_j = \sum_{p \in \mathrm{Pa}(j)} w_{jp} X_p, \]
\[ X_j = \mathrm{digitize}\!\left( s_j;\ \tau_{j1}, \dots, \tau_{j(K-1)} \right). \]

If thresholds are not provided, defaults are set from a theoretical Gaussian quantile grid, not from realised sample quantiles. By default:

  • threshold_loc = 0.0

  • threshold_scale is sampled from \(\mathrm{Uniform}(0.5, 2.0)\)

Both can be overridden in config.

Compatibility matrix

Child type

Parent types

Structural model

Noise / link

Continuous

Any

linear, polynomial, interaction, sigmoid, cos, sin, stratum_means (+ optional post_transform)

additive, multiplicative, heteroskedastic

Binary

Any

linear, polynomial, interaction, sigmoid, cos, sin, stratum_means

Latent signal + noise, then logistic link and Bernoulli draw

Categorical

Any

categorical_model = logistic or categorical_model = threshold

Softmax sampling (logistic) or threshold digitisation

For random structural weights, additional controls are random_weight_low, random_weight_high, and random_weight_min_abs. The same random_weight_min_abs exclusion is applied to auto-sampled categorical logistic weights as well.

Forced uniform marginals

Setting simulation_params.force_uniform_marginals = true overrides the default randomised marginals on exogenous nodes:

  • Exogenous binary (no explicit p): the simulator uses \(p = 0.5\) and generates an exact balanced 0/1 split rather than sampling \(X_j \sim \mathrm{Bernoulli}(0.5)\), eliminating small-sample fluctuations.

  • Exogenous categorical (no explicit probs): the simulator uses uniform \(\pi_{j,k} = 1/K\) and enforces equal counts per class (with a small remainder distributed at random).

  • Exogenous continuous: unchanged — distributional parameters are still sampled or read from the config.

If p (binary) or probs (categorical) is explicitly provided, the flag is ignored for that node and the config wins.

Random node-type assignment

When graph_params.type = "random" and a node’s type is not pinned in node_params, the simulator samples a type per node according to:

  • simulation_params.binary_proportion (default \(0.4\)).

  • simulation_params.categorical_proportion (default \(0.0\)).

  • The remainder becomes continuous.

Categorical parents in metric forms

Using categorical parents with linear, polynomial, or interaction is blocked by default (categorical_parent_metric_form_policy = "error"), because treating category codes as metric values can distort the intended DGP.

Setting categorical_parent_metric_form_policy = "stratum_means" auto-redirects such cases to stratum_means. For mixed parents (categorical + continuous / binary), the redirected stratum_means uses

\[ f_j = \mu_{\text{cat-stratum}} + \sum_{p \in \text{metric parents}} w_p X_p, \]

where categorical parents select the stratum mean and metric parents contribute an additive linear term.

Stratum-means reproducibility

For stratum_means with multiple categorical parents, all strata are pre-enumerated and assigned means upfront, ensuring stable DGP parameters even for rare or unseen strata in a particular sample.

CI oracle (ground truth)

If simulation_params.store_ci_oracle = true, the simulator stores conditional independence truth values from DAG d-separation:

\[ X \perp\!\!\!\perp Y \mid S \;\;\iff\;\; S \text{ is a d-separator of } X \text{ and } Y \text{ in } G, \]

for conditioning sets up to ci_oracle_max_cond_set. The oracle records, for every triple \((X, Y, S)\), whether the DAG structure forces \(X\) and \(Y\) to be independent given \(S\) — useful as ground truth for evaluating CI tests.

The lazy alternative — CausalDataGenerator.as_ci_oracle() — returns a DSeparationOracle satisfying the cbcd.CITest Protocol, suitable for direct use inside constraint-based algorithms; see How-to: working with the CI oracle.

References

  • Pearl, J. (2009). Causality: Models, Reasoning, and Inference (2nd ed.). Cambridge University Press.

  • Zhang, K., & Hyvärinen, A. (2009). On the identifiability of the post-nonlinear causal model. In Proceedings of UAI ‘09, 647–655.