Configuration Examples ====================== This page shows practical config templates for all major simulator options. Seeding ------- The simulator uses two independent random streams: * ``rng_structure`` controls the **data-generating process** itself — random DAG topology, sampled structural weights, intercepts, thresholds, and stratum means. * ``rng_data`` controls the **per-sample draws** given that DGP — exogenous variable values, noise draws, and Bernoulli/categorical sampling. You can seed them in two ways: * **Single** ``seed`` (convenience): ``seed_structure`` is set to ``seed`` and ``seed_data`` is derived as ``seed + 1`` so the two streams stay independent while remaining fully reproducible. Use this for one-off examples and quickstarts. * **Explicit** ``seed_structure`` and ``seed_data``: seed each stream independently. This is the recommended form for benchmarks because it lets you decouple structure from data — for example, hold ``seed_structure`` fixed and vary ``seed_data`` to measure how a CI test behaves on different finite samples from the *same* DGP. The minimal custom-DAG example below uses the single-seed form; the random-DAG example uses the explicit pair. Minimal custom DAG ------------------ .. code-block:: json { "simulation_params": { "n_samples": 200, "seed": 42 }, "graph_params": { "type": "custom", "nodes": ["X", "Y", "Z1"], "edges": [["X", "Z1"], ["Y", "Z1"]] } } Random DAG ---------- .. code-block:: json { "simulation_params": { "n_samples": 300, "seed_structure": 123, "seed_data": 124 }, "graph_params": { "type": "random", "n_nodes": 6, "edge_prob": 0.35 } } When ``type = "random"`` and node types are not pinned in ``node_params``, the simulator samples a type per node using ``binary_proportion`` (default ``0.4``) and ``categorical_proportion`` (default ``0.0``); the remainder become continuous. Override either to control the type mix: .. code-block:: json { "simulation_params": { "n_samples": 300, "seed_structure": 123, "seed_data": 124, "binary_proportion": 0.2, "categorical_proportion": 0.3 }, "graph_params": { "type": "random", "n_nodes": 6, "edge_prob": 0.35 } } Random weights with near-zero exclusion (signal-strength control) ----------------------------------------------------------------- .. code-block:: json { "simulation_params": { "n_samples": 500, "seed_structure": 201, "seed_data": 202, "random_weight_low": -1.5, "random_weight_high": 1.5, "random_weight_min_abs": 0.1 }, "graph_params": { "type": "custom", "nodes": ["X1", "X2", "X3", "Y"], "edges": [["X1", "Y"], ["X2", "Y"], ["X3", "Y"]] }, "node_params": { "X1": { "type": "continuous", "distribution": { "name": "gaussian", "mean": 0, "std": 1 } }, "X2": { "type": "continuous", "distribution": { "name": "gaussian", "mean": 0, "std": 1 } }, "X3": { "type": "continuous", "distribution": { "name": "gaussian", "mean": 0, "std": 1 } }, "Y": { "type": "continuous", "functional_form": { "name": "linear" }, "noise_model": { "name": "additive", "dist": "gaussian", "std": 0.2 } } } } In this setup, omitted linear weights are sampled from ``[-1.5, -0.1] U [0.1, 1.5]``, guaranteeing every edge contributes a minimum amount of signal rather than being effectively muted by a near-zero draw. Categorical parent with metric form policy override --------------------------------------------------- By default, categorical parents with ``linear``/``polynomial``/``interaction`` raise an error. To auto-redirect to ``stratum_means`` (including mixed-parent cases), use: .. code-block:: json { "simulation_params": { "n_samples": 300, "seed": 303, "categorical_parent_metric_form_policy": "stratum_means" }, "graph_params": { "type": "custom", "nodes": ["C", "Y"], "edges": [["C", "Y"]] }, "node_params": { "C": { "type": "categorical", "cardinality": 4 }, "Y": { "type": "continuous", "functional_form": { "name": "linear" }, "noise_model": { "name": "additive", "dist": "gaussian", "std": 0.2 } } } } Exogenous node distributions ---------------------------- .. code-block:: json { "simulation_params": { "n_samples": 500, "seed": 1 }, "graph_params": { "type": "custom", "nodes": ["G", "T", "Ga", "E", "B", "C"], "edges": [] }, "node_params": { "G": { "type": "continuous", "distribution": { "name": "gaussian", "mean": 0.0, "std": 1.0 } }, "T": { "type": "continuous", "distribution": { "name": "student_t", "df": 4 } }, "Ga": { "type": "continuous", "distribution": { "name": "gamma", "shape": 2.0, "scale": 1.0 } }, "E": { "type": "continuous", "distribution": { "name": "exponential", "scale": 1.2 } }, "B": { "type": "binary", "distribution": { "name": "bernoulli", "p": 0.35 } }, "C": { "type": "categorical", "cardinality": 5, "distribution": { "probs": [0.1, 0.2, 0.3, 0.2, 0.2] } } } } Continuous child with linear / polynomial / interaction ------------------------------------------------------- .. code-block:: json { "simulation_params": { "n_samples": 300, "seed": 10 }, "graph_params": { "type": "custom", "nodes": ["X1", "X2", "Y_lin", "Y_poly", "Y_int"], "edges": [["X1", "Y_lin"], ["X2", "Y_lin"], ["X1", "Y_poly"], ["X2", "Y_poly"], ["X1", "Y_int"], ["X2", "Y_int"]] }, "node_params": { "X1": { "type": "continuous", "distribution": { "name": "gaussian", "mean": 0, "std": 1 } }, "X2": { "type": "continuous", "distribution": { "name": "gaussian", "mean": 0, "std": 1 } }, "Y_lin": { "type": "continuous", "functional_form": { "name": "linear", "weights": { "X1": 1.2, "X2": -0.7 } }, "noise_model": { "name": "additive", "dist": "gaussian", "std": 0.5 } }, "Y_poly": { "type": "continuous", "functional_form": { "name": "polynomial", "weights": { "X1": 1.0, "X2": 0.6 }, "degrees": { "X1": 3, "X2": 2 } }, "noise_model": { "name": "additive", "dist": "student_t", "df": 5, "scale": 0.3 } }, "Y_int": { "type": "continuous", "functional_form": { "name": "interaction", "weights": { "interaction": 0.8 } }, "noise_model": { "name": "multiplicative", "dist": "gaussian", "std": 0.2 } } } } Continuous child with sigmoid / cos / sin ----------------------------------------- .. code-block:: json { "simulation_params": { "n_samples": 300, "seed": 11 }, "graph_params": { "type": "custom", "nodes": ["X1", "X2", "Y_sig", "Y_cos", "Y_sin"], "edges": [["X1", "Y_sig"], ["X2", "Y_sig"], ["X1", "Y_cos"], ["X2", "Y_cos"], ["X1", "Y_sin"], ["X2", "Y_sin"]] }, "node_params": { "X1": { "type": "continuous", "distribution": { "name": "gaussian", "mean": 0, "std": 1 } }, "X2": { "type": "continuous", "distribution": { "name": "gaussian", "mean": 0, "std": 1 } }, "Y_sig": { "type": "continuous", "functional_form": { "name": "sigmoid", "weights": { "X1": 1.0, "X2": -0.5 }, "output_weight": 1.5 }, "noise_model": { "name": "additive", "dist": "gaussian", "std": 0.3 } }, "Y_cos": { "type": "continuous", "functional_form": { "name": "cos", "weights": { "X1": 1.0, "X2": 0.5 } }, "noise_model": { "name": "additive", "dist": "gaussian", "std": 0.2 } }, "Y_sin": { "type": "continuous", "functional_form": { "name": "sin", "weights": { "X1": 0.8, "X2": 1.1 } }, "noise_model": { "name": "additive", "dist": "gaussian", "std": 0.2 } } } } For ``sigmoid``, ``output_weight`` (the post-tanh scaling :math:`w_j`) and the per-parent ``weights`` are sampled from the random-weight distribution if omitted. Post-nonlinear transform ------------------------ .. code-block:: json { "simulation_params": { "n_samples": 300, "seed": 12 }, "graph_params": { "type": "custom", "nodes": ["X", "Y"], "edges": [["X", "Y"]] }, "node_params": { "X": { "type": "continuous", "distribution": { "name": "gaussian", "mean": 0, "std": 1 } }, "Y": { "type": "continuous", "functional_form": { "name": "linear", "weights": { "X": 1.5 } }, "noise_model": { "name": "additive", "dist": "gaussian", "std": 0.4 }, "post_transform": { "name": "tanh" } } } } Replace ``"tanh"`` with any of ``sin``, ``cos``, ``exp_neg_abs``, ``sqrt_abs``, ``relu``, ``sign``. The transform is applied element-wise after the structural function and noise have been combined. Noise model variants -------------------- .. code-block:: json { "simulation_params": { "n_samples": 250, "seed": 22 }, "graph_params": { "type": "custom", "nodes": ["X", "Y_add", "Y_mult", "Y_hetero"], "edges": [["X", "Y_add"], ["X", "Y_mult"], ["X", "Y_hetero"]] }, "node_params": { "X": { "type": "continuous", "distribution": { "name": "gaussian", "mean": 0, "std": 1 } }, "Y_add": { "type": "continuous", "functional_form": { "name": "linear", "weights": { "X": 1.0 } }, "noise_model": { "name": "additive", "dist": "gamma", "shape": 2.0, "scale": 0.6 } }, "Y_mult": { "type": "continuous", "functional_form": { "name": "linear", "weights": { "X": 1.0 } }, "noise_model": { "name": "multiplicative", "dist": "exponential", "scale": 1.0 } }, "Y_hetero": { "type": "continuous", "functional_form": { "name": "linear", "weights": { "X": 1.0 } }, "noise_model": { "name": "heteroskedastic", "func": "abs_parent_plus_const" } } } } Heavy-tailed and uniform additive noise --------------------------------------- In addition to ``gaussian``, ``student_t``, ``gamma``, and ``exponential``, the additive noise model accepts ``laplace``, ``cauchy``, and ``uniform``. All three are zero-centered and parameterized by ``scale``: .. code-block:: json { "simulation_params": { "n_samples": 400, "seed": 23 }, "graph_params": { "type": "custom", "nodes": ["X", "Y_lap", "Y_cau", "Y_uni"], "edges": [["X", "Y_lap"], ["X", "Y_cau"], ["X", "Y_uni"]] }, "node_params": { "X": { "type": "continuous", "distribution": { "name": "gaussian", "mean": 0, "std": 1 } }, "Y_lap": { "type": "continuous", "functional_form": { "name": "linear", "weights": { "X": 1.0 } }, "noise_model": { "name": "additive", "dist": "laplace", "scale": 0.7 } }, "Y_cau": { "type": "continuous", "functional_form": { "name": "linear", "weights": { "X": 1.0 } }, "noise_model": { "name": "additive", "dist": "cauchy", "scale": 0.3 } }, "Y_uni": { "type": "continuous", "functional_form": { "name": "linear", "weights": { "X": 1.0 } }, "noise_model": { "name": "additive", "dist": "uniform", "scale": 1.0 } } } } Multiplicative noise also accepts ``student_t``, ``gamma``, and ``exponential`` in addition to ``gaussian``; gamma and exponential factors are normalized to mean 1 to avoid biasing the structural signal. Forced uniform marginals ------------------------ Set ``force_uniform_marginals`` to make exogenous binary nodes draw an exact 50/50 split and exogenous categorical nodes use exactly equal counts per class (when their ``p`` / ``probs`` are not explicitly set): .. code-block:: json { "simulation_params": { "n_samples": 200, "seed": 24, "force_uniform_marginals": true }, "graph_params": { "type": "custom", "nodes": ["B", "C", "Y"], "edges": [["B", "Y"], ["C", "Y"]] }, "node_params": { "B": { "type": "binary" }, "C": { "type": "categorical", "cardinality": 4 }, "Y": { "type": "continuous", "functional_form": { "name": "stratum_means" }, "noise_model": { "name": "additive", "dist": "gaussian", "std": 0.3 } } } } This is convenient for constructing balanced benchmark scenarios without worrying about small-sample fluctuations in the exogenous strata. Binary child ------------ .. code-block:: json { "simulation_params": { "n_samples": 300, "seed": 33 }, "graph_params": { "type": "custom", "nodes": ["X", "Z", "B"], "edges": [["X", "B"], ["Z", "B"]] }, "node_params": { "X": { "type": "continuous", "distribution": { "name": "gaussian", "mean": 0, "std": 1 } }, "Z": { "type": "binary", "distribution": { "name": "bernoulli", "p": 0.4 } }, "B": { "type": "binary", "functional_form": { "name": "linear", "weights": { "X": 1.3, "Z": 0.9 } }, "noise_model": { "name": "additive", "dist": "gaussian", "std": 0.5 } } } } Categorical child (logistic softmax) ------------------------------------ .. code-block:: json { "simulation_params": { "n_samples": 400, "seed_structure": 40, "seed_data": 41 }, "graph_params": { "type": "custom", "nodes": ["X", "B", "C"], "edges": [["X", "C"], ["B", "C"]] }, "node_params": { "X": { "type": "continuous", "distribution": { "name": "gaussian", "mean": 0, "std": 1 } }, "B": { "type": "binary", "distribution": { "name": "bernoulli", "p": 0.5 } }, "C": { "type": "categorical", "cardinality": 3, "categorical_model": { "name": "logistic", "intercepts": [0.0, 0.0, 0.0], "weights": { "X": [0.9, -0.2, -0.7], "B": [-0.4, 0.8, -0.3] } } } } } Continuous to categorical (threshold) ------------------------------------- .. code-block:: json { "simulation_params": { "n_samples": 350, "seed": 50 }, "graph_params": { "type": "custom", "nodes": ["X", "C"], "edges": [["X", "C"]] }, "node_params": { "X": { "type": "continuous", "distribution": { "name": "gaussian", "mean": 0, "std": 1 } }, "C": { "type": "categorical", "cardinality": 5, "categorical_model": { "name": "threshold", "weights": { "X": 1.0 }, "thresholds": [-1.0, -0.2, 0.4, 1.1] } } } } To use fixed theoretical threshold placement: .. code-block:: json { "node_params": { "C": { "type": "categorical", "cardinality": 5, "categorical_model": { "name": "threshold", "threshold_loc": 0.0, "threshold_scale": 1.0 } } } } Categorical to continuous (stratum-specific means) -------------------------------------------------- .. code-block:: json { "simulation_params": { "n_samples": 300, "seed": 60 }, "graph_params": { "type": "custom", "nodes": ["C1", "C2", "Y"], "edges": [["C1", "Y"], ["C2", "Y"]] }, "node_params": { "C1": { "type": "categorical", "cardinality": 3 }, "C2": { "type": "categorical", "cardinality": 2 }, "Y": { "type": "continuous", "functional_form": { "name": "stratum_means", "default_mean": 0.0, "strata_means": { "C1=0|C2=0": -1.5, "C1=1|C2=0": 0.2, "C1=2|C2=1": 1.8 } }, "noise_model": { "name": "additive", "dist": "gaussian", "std": 0.15 } } } } Mixed parents under stratum_means --------------------------------- When ``stratum_means`` has both categorical and metric parents, you can supply ``metric_weights`` (a per-parent dict or a single number) for the metric contribution. Omit it to have weights sampled from the random-weight distribution. .. code-block:: json { "simulation_params": { "n_samples": 300, "seed": 61, "categorical_parent_metric_form_policy": "stratum_means" }, "graph_params": { "type": "custom", "nodes": ["C", "X", "Y"], "edges": [["C", "Y"], ["X", "Y"]] }, "node_params": { "C": { "type": "categorical", "cardinality": 3 }, "X": { "type": "continuous", "distribution": { "name": "gaussian", "mean": 0, "std": 1 } }, "Y": { "type": "continuous", "functional_form": { "name": "stratum_means", "strata_means": { "C=0": -1.0, "C=1": 0.0, "C=2": 1.5 }, "metric_weights": { "X": 0.8 } }, "noise_model": { "name": "additive", "dist": "gaussian", "std": 0.2 } } } } CI oracle output ---------------- .. code-block:: json { "simulation_params": { "n_samples": 250, "seed": 70, "store_ci_oracle": true, "ci_oracle_max_cond_set": 2 }, "graph_params": { "type": "custom", "nodes": ["X", "Y", "Z"], "edges": [["X", "Z"], ["Y", "Z"]] } } When ``store_ci_oracle`` is enabled, ``simulate()`` also returns a ``ci_oracle`` list with entries of the form: .. code-block:: json { "x": "X", "y": "Y", "conditioning_set": ["Z"], "is_independent": false } The oracle iterates over every ordered pair :math:`(X, Y)` and every conditioning subset :math:`S` of size :math:`\le` ``ci_oracle_max_cond_set`` (default ``2``); both independent and dependent triples are recorded. simulate() return value ----------------------- ``CausalDataGenerator(config).simulate()`` returns a dict with the following keys: * ``data``: a ``pandas.DataFrame`` of shape ``(n_samples, n_nodes)`` containing the simulated values. * ``dag``: a ``networkx.DiGraph`` representing the realized DAG. * ``parametrization``: a deep copy of the input config with every randomly-sampled value (weights, intercepts, thresholds, stratum means, noise parameters, marginals, derived ``seed_structure`` / ``seed_data``, inferred node types) filled in. Suitable for round-tripping to JSON to reproduce the exact DGP. * ``ci_oracle`` (only present when ``store_ci_oracle = true``): the list of oracle entries described above.