Step 01
Scenario ExecutionRun model adapters across stress scenarios and fixed seeds with trace capture per tick.
How Civilization's First Exam scores model governance under uncertainty.
Objective Total reflects survival, stability, equity, adaptability, and welfare. Decision Integrity reflects counterfactual quality, judge consistency, and anti-gaming reliability.
Step 01
Scenario ExecutionRun model adapters across stress scenarios and fixed seeds with trace capture per tick.
Step 02
Counterfactual ChecksCompare chosen actions against sampled alternatives to measure decision quality deltas.
Step 03
Judge ReliabilityApply pairwise swap consistency and disagreement instability diagnostics.
Step 04
Ranking + GovernancePublish ranked outcomes with anti-gaming policy and dispute/appeal workflow.