Model Formulations ================== This page describes the mathematical structure implemented by the simulator and the valid combinations of node types, structural equations, and noise models. Notation -------- .. list-table:: :header-rows: 1 :widths: 22 78 * - Symbol - Meaning * - :math:`G = (V, E)` - Directed acyclic graph with node set :math:`V` and edge set :math:`E`. * - :math:`j \in V` - A node (variable) in the graph. * - :math:`\mathrm{Pa}(j)` - Set of parent nodes of :math:`j` in :math:`G`. * - :math:`X_j` - Random variable associated with node :math:`j`. * - :math:`X_{\mathrm{Pa}(j)}` - The vector of parent values for node :math:`j`. * - :math:`K` - Cardinality of a categorical variable (number of classes). * - :math:`\mathcal{D}_j` - Marginal distribution of an exogenous continuous node :math:`j` (Gaussian, Student-t, Gamma, or Exponential). * - :math:`p_j` - Success probability of an exogenous Bernoulli node :math:`j`. * - :math:`\pi_{j,k}` - Class probability for category :math:`k` of an exogenous categorical node :math:`j`; satisfies :math:`\sum_k \pi_{j,k} = 1`. * - :math:`f_j(\cdot)` - Structural function mapping parents of :math:`j` to its mean signal. * - :math:`\epsilon_j` - Noise term for node :math:`j` (additive, multiplicative, or heteroskedastic). * - :math:`w_{jp}` - Structural weight from parent :math:`p` to child :math:`j`. * - :math:`d_{jp}` - Polynomial degree applied to parent :math:`p` in the structural form for child :math:`j`. * - :math:`w_j` - Single interaction weight in the ``interaction`` form. * - :math:`\mu_s` - Mean assigned to categorical-parent stratum :math:`s` in the ``stratum_means`` form. * - :math:`s(\mathbf{x}_{\mathrm{Pa}(j)})` - Stratum index determined by the categorical parent values. * - :math:`L, H` - Lower / upper bounds for random structural weight sampling (``random_weight_low``, ``random_weight_high``). * - :math:`m` - Near-zero exclusion radius for random weights (``random_weight_min_abs``). * - :math:`\sigma_j(\cdot)` - Heteroskedastic noise scale as a function of parents. * - :math:`z` - Standard normal draw, :math:`z \sim \mathcal{N}(0, 1)`. * - :math:`\eta_j` - Latent signal for an endogenous binary node before the logistic link. * - :math:`\sigma(t)` - Logistic sigmoid, :math:`\sigma(t) = 1 / (1 + e^{-t})`. * - :math:`\ell_{jk}` - Logit for class :math:`k` of an endogenous categorical node :math:`j`. * - :math:`b_{jk}` - Intercept for class :math:`k` in the logistic categorical model. * - :math:`g_{jpk}(X_p)` - Contribution of parent :math:`p` to logit :math:`\ell_{jk}`. * - :math:`\tau_{j1}, \dots, \tau_{j(K-1)}` - Cut-points used by the threshold categorical model for node :math:`j`. * - :math:`\perp\!\!\!\perp` - Conditional independence (used in the CI oracle section). The simulator draws from two independent random streams: one seeds the **data-generating process** (DAG topology, structural weights, intercepts, thresholds, stratum means) and the other seeds the **per-sample draws** (exogenous values, noise, Bernoulli/categorical sampling). They are configured via ``seed_structure`` and ``seed_data`` respectively, or jointly via a single ``seed`` (see the Seeding section in :doc:`config_examples`). Graph Model ----------- The simulator generates a DAG :math:`G = (V, E)` using one of: * ``custom``: user-defined node and edge sets * ``random``: random acyclic edges over ordered nodes Node Types ---------- Supported node types: * Continuous * Binary (values in :math:`\{0, 1\}`) * Categorical (values in :math:`\{0, \dots, K-1\}`, configurable cardinality :math:`K`) Exogenous Nodes (:math:`\mathrm{Pa}(j)=\varnothing`) ---------------------------------------------------- Continuous exogenous node: .. math:: X_j \sim \mathcal{D}_j where :math:`\mathcal{D}_j` is one of Gaussian, Student-t, Gamma, or Exponential. *Intuition:* draw each value of :math:`X_j` independently from the chosen marginal distribution. Binary exogenous node: .. math:: X_j \sim \mathrm{Bernoulli}(p_j) *Intuition:* a coin flip that returns 1 with probability :math:`p_j` and 0 otherwise. Categorical exogenous node: .. math:: X_j \sim \mathrm{Categorical}(\pi_{j,0}, \dots, \pi_{j,K-1}), \quad \sum_k \pi_{j,k}=1 *Intuition:* a weighted dice roll that returns class :math:`k` with probability :math:`\pi_{j,k}`. Endogenous Continuous Nodes --------------------------- General form: .. math:: X_j = f_j(X_{\mathrm{Pa}(j)}) + \epsilon_j *Intuition:* the value of :math:`X_j` is a deterministic function of its parents plus an independent noise draw. Supported structural forms :math:`f_j`: Linear: .. math:: f_j = \sum_{p \in \mathrm{Pa}(j)} w_{jp} X_p *Intuition:* a weighted sum of the parent values. Polynomial: .. math:: f_j = \sum_{p \in \mathrm{Pa}(j)} w_{jp} X_p^{d_{jp}} *Intuition:* a weighted sum where each parent is first raised to its own fixed power. Interaction: .. math:: f_j = w_j \prod_{p \in \mathrm{Pa}(j)} X_p *Intuition:* the product of all parent values, scaled by a single weight. Sigmoid (tanh): .. math:: f_j = w_j \cdot \tanh\!\left( \sum_{p \in \mathrm{Pa}(j)} w_{jp} X_p \right) *Intuition:* a smooth saturating nonlinearity — the weighted parent sum is squashed by ``tanh`` and rescaled by an output weight :math:`w_j`. Cosine: .. math:: f_j = \cos\!\left( \sum_{p \in \mathrm{Pa}(j)} w_{jp} X_p \right) Sine: .. math:: f_j = \sin\!\left( \sum_{p \in \mathrm{Pa}(j)} w_{jp} X_p \right) *Intuition:* the parent values are first combined linearly, then passed through a periodic nonlinearity. Useful for stress-testing kernel-based CI tests on oscillatory dependencies. Stratum-specific means (categorical parents to continuous child): .. math:: f_j = \mu_{s(\mathbf{x}_{\mathrm{Pa}(j)})} where :math:`s(\cdot)` indexes the categorical parent stratum. *Intuition:* look up a pre-assigned mean for the combination of categorical parent values observed at this row. When ``stratum_means`` is used with **mixed parents** (at least one categorical parent plus one or more metric parents), the structural function combines a stratum mean with a linear contribution from the metric parents: .. math:: f_j = \mu_{s(\mathbf{x}_{\mathrm{cat}})} + \sum_{p \in \text{metric parents}} w_{jp} X_p The metric weights can be set explicitly via ``functional_form.metric_weights`` (a dict per parent or a single number applied to all metric parents), or sampled from the random-weight distribution if omitted. Random structural weights ------------------------- When ``weights`` are omitted for ``linear``, ``polynomial``, or ``interaction``, the simulator samples weights from a configurable interval: .. math:: w \sim \mathrm{Uniform}(L, H) where ``L=random_weight_low`` and ``H=random_weight_high``. *Intuition:* when you don't pin a weight, it's drawn uniformly between :math:`L` and :math:`H`. If ``random_weight_min_abs = m > 0``, values in :math:`(-m, m)` are excluded and weights are sampled from: .. math:: [L, -m] \cup [m, H] This guarantees a minimum signal strength on every edge, giving you direct control over how strongly each parent influences its child rather than letting random sampling produce effectively-zero coefficients. *Intuition:* every edge contributes at least :math:`m` worth of signal, so no parent ends up silently muted by the random draw. Noise models: Additive: .. math:: X_j = f_j + \epsilon_j *Intuition:* the noise is added on top of the structural signal. Additive noise distributions accepted under ``noise_model.dist``: * ``gaussian`` (parameter ``std``) * ``student_t`` (parameters ``df``, ``scale``) * ``gamma`` (parameters ``shape``, ``scale``; centered to zero mean) * ``exponential`` (parameter ``scale``; centered to zero mean) * ``laplace`` (parameter ``scale``; zero-centered) * ``cauchy`` (parameter ``scale``; zero-centered, heavy-tailed) * ``uniform`` (parameter ``scale``; symmetric on :math:`[-\text{scale}, \text{scale}]`) Multiplicative: .. math:: X_j = f_j \cdot (1 + \epsilon_j') *Intuition:* the noise scales the structural signal, so the spread grows with the magnitude of :math:`f_j`. Multiplicative noise also supports ``gaussian``, ``student_t``, ``gamma``, and ``exponential`` distributions for :math:`\epsilon_j'`. Gamma and exponential factors are normalized to mean 1 so the structural signal is not biased; all factors are clipped to a small positive minimum for numerical safety. Heteroskedastic: .. math:: X_j = f_j + \sigma_j(X_{\mathrm{Pa}(j)}) z, \quad z \sim \mathcal{N}(0,1) *Intuition:* additive Gaussian noise whose standard deviation depends on the parent values. with registered :math:`\sigma_j(\cdot)` choices: * ``abs_first_parent`` (default when ``func`` is omitted) * ``abs_parent_plus_const`` * ``mean_abs_plus_const`` Post-nonlinear transform ------------------------ Any continuous endogenous node may apply a final element-wise nonlinearity to its output after the structural function and noise have been combined: .. math:: X_j \leftarrow g(X_j) where :math:`g` is selected by ``post_transform.name`` from the registry: .. list-table:: :header-rows: 1 :widths: 22 78 * - Name - Function * - ``tanh`` - :math:`\tanh(x)` * - ``sin`` - :math:`\sin(x)` * - ``cos`` - :math:`\cos(x)` * - ``exp_neg_abs`` - :math:`\exp(-|x|)` * - ``sqrt_abs`` - :math:`\sqrt{|x|}` * - ``relu`` - :math:`\max(0, x)` * - ``sign`` - :math:`\mathrm{sign}(x)` *Intuition:* the structural function and noise model determine the *signal*; ``post_transform`` warps that signal afterwards. This is how the literature typically realizes "post-nonlinear" DGPs (e.g., :math:`Y = \tanh(\text{linear}(X) + \epsilon)`). Endogenous Binary Nodes ----------------------- Binary children use a logistic link on the latent signal: .. math:: \eta_j = f_j(X_{\mathrm{Pa}(j)}) + \epsilon_j *Intuition:* build a continuous latent score from the parents and a noise term. .. math:: \Pr(X_j=1 \mid X_{\mathrm{Pa}(j)}) = \sigma(\eta_j), \quad \sigma(t)=\frac{1}{1+e^{-t}} *Intuition:* squash the latent score into a probability between 0 and 1. .. math:: X_j \sim \mathrm{Bernoulli}\!\left(\sigma(\eta_j)\right) *Intuition:* flip a biased coin with that probability to decide whether :math:`X_j` is 0 or 1. Endogenous Categorical Nodes ---------------------------- Two models are supported. 1. Logistic (multinomial softmax) .. math:: \ell_{jk} = b_{jk} + \sum_{p \in \mathrm{Pa}(j)} g_{jpk}(X_p) *Intuition:* compute one logit per class as an intercept plus parent contributions. .. math:: \Pr(X_j=k \mid X_{\mathrm{Pa}(j)}) = \frac{\exp(\ell_{jk})}{\sum_{m=0}^{K-1} \exp(\ell_{jm})} *Intuition:* convert the logits into class probabilities via softmax, then sample a class from that distribution. where :math:`g_{jpk}` depends on parent type: * continuous/binary parent: linear contribution per class — ``weights[parent]`` is a length-:math:`K` vector, one coefficient per child class. * categorical parent: class-specific lookup via a parent-category weight matrix of shape :math:`(K_{\text{parent}}, K)` — one row per parent class, one column per child class. 2. Threshold (continuous-to-categorical) .. math:: s_j = \sum_{p \in \mathrm{Pa}(j)} w_{jp} X_p *Intuition:* form a continuous score from a weighted sum of parents. .. math:: X_j = \mathrm{digitize}(s_j; \tau_{j1}, \dots, \tau_{j(K-1)}) *Intuition:* assign a class based on which bin the score falls into, defined by the cut-points :math:`\tau_{j1}, \dots, \tau_{j(K-1)}`. If thresholds are not provided, defaults are set from a theoretical Gaussian quantile grid, not from realized sample quantiles. By default: * ``threshold_loc = 0.0`` * ``threshold_scale`` is sampled from ``Uniform(0.5, 2.0)`` You can override both explicitly in config. Compatibility Matrix -------------------- .. list-table:: Supported combinations :header-rows: 1 :widths: 16 24 28 32 * - Child type - Parent types - Structural model - Noise / link * - Continuous - Continuous, binary, categorical, or mixed - ``linear``, ``polynomial``, ``interaction``, ``sigmoid``, ``cos``, ``sin``, ``stratum_means`` (+ optional ``post_transform``) - ``additive``, ``multiplicative``, ``heteroskedastic`` * - Binary - Continuous, binary, categorical, or mixed - ``linear``, ``polynomial``, ``interaction``, ``sigmoid``, ``cos``, ``sin``, ``stratum_means`` - Latent signal + noise, then logistic link and Bernoulli draw * - Categorical - Continuous, binary, categorical, or mixed - ``categorical_model = logistic`` or ``categorical_model = threshold`` - Softmax sampling (logistic) or threshold digitization For random structural weights, additional controls are: ``random_weight_low``, ``random_weight_high``, and ``random_weight_min_abs``. The same ``random_weight_min_abs`` exclusion is applied to auto-sampled categorical logistic weights as well. Forced uniform marginals ------------------------ Set ``simulation_params.force_uniform_marginals = true`` to override the default randomized marginals on exogenous nodes: * **Exogenous binary** (no explicit ``p``): the simulator uses :math:`p = 0.5` *and* generates an exact balanced 0/1 split rather than sampling :math:`X_j \sim \mathrm{Bernoulli}(0.5)`, eliminating small-sample fluctuations. * **Exogenous categorical** (no explicit ``probs``): the simulator uses uniform :math:`\pi_{j,k} = 1/K` *and* enforces equal counts per class (with a small remainder distributed at random). * **Exogenous continuous**: unchanged — distributional parameters are still sampled or read from the config. If ``p`` (binary) or ``probs`` (categorical) is explicitly provided, the flag is ignored for that node and your config wins. Random node-type assignment --------------------------- When ``graph_params.type = "random"`` and a node's ``type`` is not pinned in ``node_params``, the simulator samples a type per node according to: * ``simulation_params.binary_proportion`` (default ``0.4``) * ``simulation_params.categorical_proportion`` (default ``0.0``) * the remainder becomes continuous Categorical parents in metric forms ----------------------------------- Using categorical parents with ``linear``, ``polynomial``, or ``interaction`` is blocked by default (``categorical_parent_metric_form_policy = "error"``), because treating category codes as metric values can distort the intended DGP. Set ``categorical_parent_metric_form_policy = "stratum_means"`` to auto-redirect such cases to ``stratum_means``. For mixed parents (categorical + continuous/binary), redirected ``stratum_means`` uses: .. math:: f_j = \mu_{\text{cat-stratum}} + \sum_{p \in \text{metric parents}} w_p X_p where categorical parents select the stratum mean and metric parents contribute an additive linear term. Stratum means reproducibility ----------------------------- For ``stratum_means`` with multiple categorical parents, all strata are pre-enumerated and assigned means upfront, ensuring stable DGP parameters even for rare/unseen strata in a particular sample. CI Oracle (Ground Truth) ------------------------ If ``simulation_params.store_ci_oracle = true``, the simulator stores conditional independence truth values from DAG d-separation: .. math:: X \perp\!\!\!\perp Y \mid S \iff S \text{ is a d-separator of } X \text{ and } Y \text{ in } G for conditioning sets up to ``ci_oracle_max_cond_set``. *Intuition:* the oracle records, for every triple :math:`(X, Y, S)`, whether the DAG structure forces :math:`X` and :math:`Y` to be independent given :math:`S` — useful as ground truth for evaluating CI tests.