Configuration Examples¶
This page shows practical config templates for all major simulator options.
Seeding¶
The simulator uses two independent random streams:
rng_structurecontrols the data-generating process itself — random DAG topology, sampled structural weights, intercepts, thresholds, and stratum means.rng_datacontrols the per-sample draws given that DGP — exogenous variable values, noise draws, and Bernoulli/categorical sampling.
You can seed them in two ways:
Single
seed(convenience):seed_structureis set toseedandseed_datais derived asseed + 1so the two streams stay independent while remaining fully reproducible. Use this for one-off examples and quickstarts.Explicit
seed_structureandseed_data: seed each stream independently. This is the recommended form for benchmarks because it lets you decouple structure from data — for example, holdseed_structurefixed and varyseed_datato measure how a CI test behaves on different finite samples from the same DGP.
The minimal custom-DAG example below uses the single-seed form; the random-DAG example uses the explicit pair.
Minimal custom DAG¶
{
"simulation_params": {
"n_samples": 200,
"seed": 42
},
"graph_params": {
"type": "custom",
"nodes": ["X", "Y", "Z1"],
"edges": [["X", "Z1"], ["Y", "Z1"]]
}
}
Random DAG¶
{
"simulation_params": {
"n_samples": 300,
"seed_structure": 123,
"seed_data": 124
},
"graph_params": {
"type": "random",
"n_nodes": 6,
"edge_prob": 0.35
}
}
When type = "random" and node types are not pinned in node_params, the
simulator samples a type per node using binary_proportion (default 0.4)
and categorical_proportion (default 0.0); the remainder become
continuous. Override either to control the type mix:
{
"simulation_params": {
"n_samples": 300,
"seed_structure": 123,
"seed_data": 124,
"binary_proportion": 0.2,
"categorical_proportion": 0.3
},
"graph_params": { "type": "random", "n_nodes": 6, "edge_prob": 0.35 }
}
Random weights with near-zero exclusion (signal-strength control)¶
{
"simulation_params": {
"n_samples": 500,
"seed_structure": 201,
"seed_data": 202,
"random_weight_low": -1.5,
"random_weight_high": 1.5,
"random_weight_min_abs": 0.1
},
"graph_params": {
"type": "custom",
"nodes": ["X1", "X2", "X3", "Y"],
"edges": [["X1", "Y"], ["X2", "Y"], ["X3", "Y"]]
},
"node_params": {
"X1": { "type": "continuous", "distribution": { "name": "gaussian", "mean": 0, "std": 1 } },
"X2": { "type": "continuous", "distribution": { "name": "gaussian", "mean": 0, "std": 1 } },
"X3": { "type": "continuous", "distribution": { "name": "gaussian", "mean": 0, "std": 1 } },
"Y": {
"type": "continuous",
"functional_form": { "name": "linear" },
"noise_model": { "name": "additive", "dist": "gaussian", "std": 0.2 }
}
}
}
In this setup, omitted linear weights are sampled from
[-1.5, -0.1] U [0.1, 1.5], guaranteeing every edge contributes a minimum
amount of signal rather than being effectively muted by a near-zero draw.
Categorical parent with metric form policy override¶
By default, categorical parents with linear/polynomial/interaction
raise an error. To auto-redirect to stratum_means (including mixed-parent
cases), use:
{
"simulation_params": {
"n_samples": 300,
"seed": 303,
"categorical_parent_metric_form_policy": "stratum_means"
},
"graph_params": {
"type": "custom",
"nodes": ["C", "Y"],
"edges": [["C", "Y"]]
},
"node_params": {
"C": { "type": "categorical", "cardinality": 4 },
"Y": {
"type": "continuous",
"functional_form": { "name": "linear" },
"noise_model": { "name": "additive", "dist": "gaussian", "std": 0.2 }
}
}
}
Exogenous node distributions¶
{
"simulation_params": {
"n_samples": 500,
"seed": 1
},
"graph_params": {
"type": "custom",
"nodes": ["G", "T", "Ga", "E", "B", "C"],
"edges": []
},
"node_params": {
"G": { "type": "continuous", "distribution": { "name": "gaussian", "mean": 0.0, "std": 1.0 } },
"T": { "type": "continuous", "distribution": { "name": "student_t", "df": 4 } },
"Ga": { "type": "continuous", "distribution": { "name": "gamma", "shape": 2.0, "scale": 1.0 } },
"E": { "type": "continuous", "distribution": { "name": "exponential", "scale": 1.2 } },
"B": { "type": "binary", "distribution": { "name": "bernoulli", "p": 0.35 } },
"C": {
"type": "categorical",
"cardinality": 5,
"distribution": { "probs": [0.1, 0.2, 0.3, 0.2, 0.2] }
}
}
}
Continuous child with linear / polynomial / interaction¶
{
"simulation_params": { "n_samples": 300, "seed": 10 },
"graph_params": {
"type": "custom",
"nodes": ["X1", "X2", "Y_lin", "Y_poly", "Y_int"],
"edges": [["X1", "Y_lin"], ["X2", "Y_lin"], ["X1", "Y_poly"], ["X2", "Y_poly"], ["X1", "Y_int"], ["X2", "Y_int"]]
},
"node_params": {
"X1": { "type": "continuous", "distribution": { "name": "gaussian", "mean": 0, "std": 1 } },
"X2": { "type": "continuous", "distribution": { "name": "gaussian", "mean": 0, "std": 1 } },
"Y_lin": {
"type": "continuous",
"functional_form": { "name": "linear", "weights": { "X1": 1.2, "X2": -0.7 } },
"noise_model": { "name": "additive", "dist": "gaussian", "std": 0.5 }
},
"Y_poly": {
"type": "continuous",
"functional_form": { "name": "polynomial", "weights": { "X1": 1.0, "X2": 0.6 }, "degrees": { "X1": 3, "X2": 2 } },
"noise_model": { "name": "additive", "dist": "student_t", "df": 5, "scale": 0.3 }
},
"Y_int": {
"type": "continuous",
"functional_form": { "name": "interaction", "weights": { "interaction": 0.8 } },
"noise_model": { "name": "multiplicative", "dist": "gaussian", "std": 0.2 }
}
}
}
Continuous child with sigmoid / cos / sin¶
{
"simulation_params": { "n_samples": 300, "seed": 11 },
"graph_params": {
"type": "custom",
"nodes": ["X1", "X2", "Y_sig", "Y_cos", "Y_sin"],
"edges": [["X1", "Y_sig"], ["X2", "Y_sig"], ["X1", "Y_cos"], ["X2", "Y_cos"], ["X1", "Y_sin"], ["X2", "Y_sin"]]
},
"node_params": {
"X1": { "type": "continuous", "distribution": { "name": "gaussian", "mean": 0, "std": 1 } },
"X2": { "type": "continuous", "distribution": { "name": "gaussian", "mean": 0, "std": 1 } },
"Y_sig": {
"type": "continuous",
"functional_form": { "name": "sigmoid", "weights": { "X1": 1.0, "X2": -0.5 }, "output_weight": 1.5 },
"noise_model": { "name": "additive", "dist": "gaussian", "std": 0.3 }
},
"Y_cos": {
"type": "continuous",
"functional_form": { "name": "cos", "weights": { "X1": 1.0, "X2": 0.5 } },
"noise_model": { "name": "additive", "dist": "gaussian", "std": 0.2 }
},
"Y_sin": {
"type": "continuous",
"functional_form": { "name": "sin", "weights": { "X1": 0.8, "X2": 1.1 } },
"noise_model": { "name": "additive", "dist": "gaussian", "std": 0.2 }
}
}
}
For sigmoid, output_weight (the post-tanh scaling \(w_j\)) and the
per-parent weights are sampled from the random-weight distribution if
omitted.
Post-nonlinear transform¶
{
"simulation_params": { "n_samples": 300, "seed": 12 },
"graph_params": {
"type": "custom",
"nodes": ["X", "Y"],
"edges": [["X", "Y"]]
},
"node_params": {
"X": { "type": "continuous", "distribution": { "name": "gaussian", "mean": 0, "std": 1 } },
"Y": {
"type": "continuous",
"functional_form": { "name": "linear", "weights": { "X": 1.5 } },
"noise_model": { "name": "additive", "dist": "gaussian", "std": 0.4 },
"post_transform": { "name": "tanh" }
}
}
}
Replace "tanh" with any of sin, cos, exp_neg_abs, sqrt_abs,
relu, sign. The transform is applied element-wise after the structural
function and noise have been combined.
Noise model variants¶
{
"simulation_params": { "n_samples": 250, "seed": 22 },
"graph_params": {
"type": "custom",
"nodes": ["X", "Y_add", "Y_mult", "Y_hetero"],
"edges": [["X", "Y_add"], ["X", "Y_mult"], ["X", "Y_hetero"]]
},
"node_params": {
"X": { "type": "continuous", "distribution": { "name": "gaussian", "mean": 0, "std": 1 } },
"Y_add": {
"type": "continuous",
"functional_form": { "name": "linear", "weights": { "X": 1.0 } },
"noise_model": { "name": "additive", "dist": "gamma", "shape": 2.0, "scale": 0.6 }
},
"Y_mult": {
"type": "continuous",
"functional_form": { "name": "linear", "weights": { "X": 1.0 } },
"noise_model": { "name": "multiplicative", "dist": "exponential", "scale": 1.0 }
},
"Y_hetero": {
"type": "continuous",
"functional_form": { "name": "linear", "weights": { "X": 1.0 } },
"noise_model": { "name": "heteroskedastic", "func": "abs_parent_plus_const" }
}
}
}
Heavy-tailed and uniform additive noise¶
In addition to gaussian, student_t, gamma, and exponential, the
additive noise model accepts laplace, cauchy, and uniform. All
three are zero-centered and parameterized by scale:
{
"simulation_params": { "n_samples": 400, "seed": 23 },
"graph_params": {
"type": "custom",
"nodes": ["X", "Y_lap", "Y_cau", "Y_uni"],
"edges": [["X", "Y_lap"], ["X", "Y_cau"], ["X", "Y_uni"]]
},
"node_params": {
"X": { "type": "continuous", "distribution": { "name": "gaussian", "mean": 0, "std": 1 } },
"Y_lap": {
"type": "continuous",
"functional_form": { "name": "linear", "weights": { "X": 1.0 } },
"noise_model": { "name": "additive", "dist": "laplace", "scale": 0.7 }
},
"Y_cau": {
"type": "continuous",
"functional_form": { "name": "linear", "weights": { "X": 1.0 } },
"noise_model": { "name": "additive", "dist": "cauchy", "scale": 0.3 }
},
"Y_uni": {
"type": "continuous",
"functional_form": { "name": "linear", "weights": { "X": 1.0 } },
"noise_model": { "name": "additive", "dist": "uniform", "scale": 1.0 }
}
}
}
Multiplicative noise also accepts student_t, gamma, and exponential
in addition to gaussian; gamma and exponential factors are normalized to
mean 1 to avoid biasing the structural signal.
Forced uniform marginals¶
Set force_uniform_marginals to make exogenous binary nodes draw an exact
50/50 split and exogenous categorical nodes use exactly equal counts per class
(when their p / probs are not explicitly set):
{
"simulation_params": {
"n_samples": 200,
"seed": 24,
"force_uniform_marginals": true
},
"graph_params": {
"type": "custom",
"nodes": ["B", "C", "Y"],
"edges": [["B", "Y"], ["C", "Y"]]
},
"node_params": {
"B": { "type": "binary" },
"C": { "type": "categorical", "cardinality": 4 },
"Y": {
"type": "continuous",
"functional_form": { "name": "stratum_means" },
"noise_model": { "name": "additive", "dist": "gaussian", "std": 0.3 }
}
}
}
This is convenient for constructing balanced benchmark scenarios without worrying about small-sample fluctuations in the exogenous strata.
Binary child¶
{
"simulation_params": { "n_samples": 300, "seed": 33 },
"graph_params": {
"type": "custom",
"nodes": ["X", "Z", "B"],
"edges": [["X", "B"], ["Z", "B"]]
},
"node_params": {
"X": { "type": "continuous", "distribution": { "name": "gaussian", "mean": 0, "std": 1 } },
"Z": { "type": "binary", "distribution": { "name": "bernoulli", "p": 0.4 } },
"B": {
"type": "binary",
"functional_form": { "name": "linear", "weights": { "X": 1.3, "Z": 0.9 } },
"noise_model": { "name": "additive", "dist": "gaussian", "std": 0.5 }
}
}
}
Categorical child (logistic softmax)¶
{
"simulation_params": { "n_samples": 400, "seed_structure": 40, "seed_data": 41 },
"graph_params": {
"type": "custom",
"nodes": ["X", "B", "C"],
"edges": [["X", "C"], ["B", "C"]]
},
"node_params": {
"X": { "type": "continuous", "distribution": { "name": "gaussian", "mean": 0, "std": 1 } },
"B": { "type": "binary", "distribution": { "name": "bernoulli", "p": 0.5 } },
"C": {
"type": "categorical",
"cardinality": 3,
"categorical_model": {
"name": "logistic",
"intercepts": [0.0, 0.0, 0.0],
"weights": {
"X": [0.9, -0.2, -0.7],
"B": [-0.4, 0.8, -0.3]
}
}
}
}
}
Continuous to categorical (threshold)¶
{
"simulation_params": { "n_samples": 350, "seed": 50 },
"graph_params": {
"type": "custom",
"nodes": ["X", "C"],
"edges": [["X", "C"]]
},
"node_params": {
"X": { "type": "continuous", "distribution": { "name": "gaussian", "mean": 0, "std": 1 } },
"C": {
"type": "categorical",
"cardinality": 5,
"categorical_model": {
"name": "threshold",
"weights": { "X": 1.0 },
"thresholds": [-1.0, -0.2, 0.4, 1.1]
}
}
}
}
To use fixed theoretical threshold placement:
{
"node_params": {
"C": {
"type": "categorical",
"cardinality": 5,
"categorical_model": {
"name": "threshold",
"threshold_loc": 0.0,
"threshold_scale": 1.0
}
}
}
}
Categorical to continuous (stratum-specific means)¶
{
"simulation_params": { "n_samples": 300, "seed": 60 },
"graph_params": {
"type": "custom",
"nodes": ["C1", "C2", "Y"],
"edges": [["C1", "Y"], ["C2", "Y"]]
},
"node_params": {
"C1": { "type": "categorical", "cardinality": 3 },
"C2": { "type": "categorical", "cardinality": 2 },
"Y": {
"type": "continuous",
"functional_form": {
"name": "stratum_means",
"default_mean": 0.0,
"strata_means": {
"C1=0|C2=0": -1.5,
"C1=1|C2=0": 0.2,
"C1=2|C2=1": 1.8
}
},
"noise_model": { "name": "additive", "dist": "gaussian", "std": 0.15 }
}
}
}
Mixed parents under stratum_means¶
When stratum_means has both categorical and metric parents, you can supply
metric_weights (a per-parent dict or a single number) for the metric
contribution. Omit it to have weights sampled from the random-weight
distribution.
{
"simulation_params": {
"n_samples": 300,
"seed": 61,
"categorical_parent_metric_form_policy": "stratum_means"
},
"graph_params": {
"type": "custom",
"nodes": ["C", "X", "Y"],
"edges": [["C", "Y"], ["X", "Y"]]
},
"node_params": {
"C": { "type": "categorical", "cardinality": 3 },
"X": { "type": "continuous", "distribution": { "name": "gaussian", "mean": 0, "std": 1 } },
"Y": {
"type": "continuous",
"functional_form": {
"name": "stratum_means",
"strata_means": { "C=0": -1.0, "C=1": 0.0, "C=2": 1.5 },
"metric_weights": { "X": 0.8 }
},
"noise_model": { "name": "additive", "dist": "gaussian", "std": 0.2 }
}
}
}
CI oracle output¶
{
"simulation_params": {
"n_samples": 250,
"seed": 70,
"store_ci_oracle": true,
"ci_oracle_max_cond_set": 2
},
"graph_params": {
"type": "custom",
"nodes": ["X", "Y", "Z"],
"edges": [["X", "Z"], ["Y", "Z"]]
}
}
When store_ci_oracle is enabled, simulate() also returns a ci_oracle
list with entries of the form:
{
"x": "X",
"y": "Y",
"conditioning_set": ["Z"],
"is_independent": false
}
The oracle iterates over every ordered pair \((X, Y)\) and every
conditioning subset \(S\) of size \(\le\) ci_oracle_max_cond_set
(default 2); both independent and dependent triples are recorded.
simulate() return value¶
CausalDataGenerator(config).simulate() returns a dict with the following
keys:
data: apandas.DataFrameof shape(n_samples, n_nodes)containing the simulated values.dag: anetworkx.DiGraphrepresenting the realized DAG.parametrization: a deep copy of the input config with every randomly-sampled value (weights, intercepts, thresholds, stratum means, noise parameters, marginals, derivedseed_structure/seed_data, inferred node types) filled in. Suitable for round-tripping to JSON to reproduce the exact DGP.ci_oracle(only present whenstore_ci_oracle = true): the list of oracle entries described above.