# Adversarial Audit — Engine + Dashboard

**Date:** 2026-05-03
**Engine version audited:** 0.1.0
**Auditor stance:** adversarial. Goal is to find what's wrong, not to certify what's right.

---

## TL;DR

The dashboard markets itself as "Predictive Intelligence Briefing." The engine that produces those probabilities has **never been scored against a single resolved outcome** (`actuals.yaml` is empty; `engine_history.json` has 1 snapshot). Today's published probabilities are also not even *engine* output — `signals.yaml` sets `score_override` on all four condition scores, so the structural model is bypassed. Every downstream layer (psych modifiers, Bayesian updates, Monte Carlo, game-theoretic Nash search, Kelly-sized "alpha signals") then operates on top of those human-typed numbers, dressed in formal apparatus.

The output is internally consistent, well-organized, and visually persuasive. It is not, on any current evidence, *predictive*. Until calibration data accumulates, the headline framing is overpromised.

---

## TIER 1 — Validity gaps that defeat the predictive claim

### F1. The calibration spine has nothing to calibrate against
- `engine_history.json` → 1 entry (today).
- `actuals.yaml` → `resolved: []`.
- `compute_calibration` returns `brier_aggregate: None` for every emit.
- Every probability shown to a reader is unfalsified and currently unfalsifiable from inside the system.

**Implication:** the dashboard's confidence-band, ensemble weights, and "DIVERGENT/CONVERGENT" badges are decorative. There is no historical Brier score, no reliability-diagram fit, no out-of-sample validation. "Model says 19%" is one analyst's typed inputs propagated through chosen formulas — not a calibrated forecast.

### F2. The structural model is bypassed in production today
`signals.yaml` currently contains:
```yaml
condition_inputs:
  deal:           {score_override: 0.42, ...}
  us_exit:        {score_override: 0.64, ...}
  iran_acceptance:{score_override: 0.30, ...}
  escalation:     {score_override: 0.42, ...}
```
Every condition score returns its override (`compute.py:36`). The structural branch (`base = 0.20 + 0.25 if iran_proposal_active + ...`) does not run. The dashboard is rendering downstream consequences (psych deltas, ensemble synthesis, MC rollouts) of human-typed scores.

This is fine *as a workflow* — but the UI does not announce it. A reader sees "decisionEngine.dealAvailability.score: 0.42" and assumes the engine derived it. It did not.

### F3. Inputs are unscored guesses with no source attribution
Every value in `stakeholders`, `iran_regime_dynamics`, `us_dynamics`, `iran_deep_dynamics`, `world_dynamics`, `historical_ideology`, `historical_analogs`, `exotic_signals` is hand-typed.

- `khamenei.religious_zeal: 0.95` — no rubric, no citation.
- `regime_brittleness: 0.35` — no source.
- `historical_analogs.korean_war_armistice_1953: 0.6` — analyst's similarity intuition, fed back as analog weight, recovered as "top analog: Korean War." Circular.
- `qom_seminary_dissent: 0.15`, `mojtaba_succession_lock: 0.6`, `gcc_realignment: 0.7` — vibes.

There is no rubric file, no inter-rater reliability, no source-link convention. Two equally informed analysts would type wildly different numbers and the engine would emit wildly different probabilities.

### F4. Point estimates without uncertainty propagation
- All inputs are scalars, not distributions.
- The engine treats `0.95` and `0.85` as exact, multiplies them, reports outputs to 3 decimal places.
- "Confidence intervals" in `resolution_probability` are heuristic gaps (`spread = max(10, abs(model − market))`), not statistical CIs.
- A single unstable input (e.g., `religious_zeal` typed as 0.85 vs 0.95) shifts `iranAcceptance` delta by ~3pp through the multiplicative chain. No sensitivity analysis surfaces this.

---

## TIER 2 — Math choices that systematically bias output

### F5. Multiplicative independence assumption is wrong-direction biased
`negotiated = dealAvailability × usExitPressure × iranAcceptance` (`compute.py:115`).

This treats the three preconditions as **independent**. In reality:
- US exit pressure ↑ correlates with deal availability ↑ (US accepts narrower terms when domestic pain is high).
- Iran acceptance ↑ correlates with deal availability ↑ (mediator activity surges precisely when Iran signals flexibility).

Multiplying positively-correlated probabilities as if independent **understates** the true joint probability. Effect: the model systematically suppresses the `deal` bucket. That is exactly the bucket the dashboard headlines.

### F6. Ensemble weights are asymmetric across buckets
`synthesized_outcome_probabilities` (`compute.py:810-826`):
- Structural: 0.45
- Historical: 0.30
- Market: 0.20 — but **only applies to the `deal` bucket** (`polymarket_deal_by_jun30_pct` is the only market signal in `signals.yaml`).
- For `escalation`/`protracted`/`intervention`: market term is `None`, code drops it from the ensemble and renormalizes.

Result: the *effective* structural+historical weights are 0.60/0.40 for non-deal buckets and 0.45/0.30 for deal. No documented rationale. This skews the relative ranking of buckets in ways no reader can audit.

### F7. Same input triple- and quadruple-counted
`khamenei.religious_zeal` enters the output through:
1. `_religious_zeal_lock` in `psych_modifiers` (subtracts from iranAcceptance).
2. `iran_zeal` payoff coefficient in `game_theory_equilibrium`.
3. `deep_read` narrative threshold (>0.85 triggers "Khamenei lock" paragraph).
4. `ultraread` synthesis paragraph.
5. (Indirectly, again) the `structural_iran_pressure` term in `forward_projections`.

Each pathway adds its own contribution. A reader sees five different surfaces (modifier delta, Nash payoff, deep_read bullet, ultra_read paragraph, forward projection) all reinforcing the same single typed input. The signal is not multiplied by evidence; it is multiplied by displays.

### F8. Magic-number coefficients everywhere, none sourced
Sample from `compute.py` and `advanced.py`:
```
base = 0.20         # deal_availability prior — why 0.20 not 0.10 or 0.30?
* 0.25, * 0.30      # condition-input weights
* 0.85              # escalation = esc * (1 - deal) * 0.85
* 0.60              # ego_lock_penalty * 0.6
midterm_amplifier   # max(0.5, 1.0 - days/730)
* 1.5               # confidence = 1 - disagreement * 1.5
fracture > 0.20     # threshold for shifting protracted → deal
```
None are calibrated, none are sensitivity-tested, none cite a source. Sweeping any single one across [0.5×, 1.5×] of its current value would noticeably shift the published probabilities.

### F9. Threshold cliffs in forward projections
`forward_projections` (`compute.py:455-475`):
```python
"direction": "rising" if structural_iran_pressure > 0.2 else "falling"
"direction": "rising" if scores["escalationProximity"] > 0.5 else "flat"
"direction": "rising" if structural_iran_pressure + structural_us_pressure > 0.3 else "flat"
```
A value of 0.21 vs 0.19 flips direction completely. These should be smooth (sigmoid, or signed magnitude with a deadband), not step functions.

### F10. Crystallization-trigger priors are off by ~3-10×
`crystallization_triggers` (`compute.py:958-1036`) lists `prior_p` values that don't survive a sniff test:
- "Khamenei dies or steps down in 90 days, prior_p = 0.10." Actuarial baseline for an 86-year-old male over 90 days ≈ 2-3%. The 0.10 figure is ~3-5× too high.
- "US gas crosses $4.50 in 21 days, prior_p = 0.45" — gas is currently $4.30 with a 47% war-driven move; 0.45 is plausible but unsourced.
- "China announces sanctions enforcement in 60 days, prior_p = 0.05" — defensible upper bound, but again unsourced.

The "watch-for highest-leverage trigger" surfaced in `ultraread` is sensitive to these priors. If the Khamenei-dies prior is 0.03 instead of 0.10, the headline trigger changes.

### F11. Bayesian updates run on fabricated likelihoods
`bayesian_evidence_chain` calls `bayesian_update(0.10, 0.85, 0.15)` etc. The math is correct Bayes; the inputs (`P(low appearances | succession imminent) = 0.85`) are made up. There is no historical base rate of supreme-leader public-appearance frequencies preceding successions, no reference event, no posterior validated against an outcome.

This is cargo-cult Bayesianism: correct apparatus on invented numbers.

---

## TIER 3 — Model-as-system problems

### F12. Reflexivity adjustment is symbolic and arguably buggy
`reflexivity_adjustment` (`advanced.py:450-467`) docstring promises mean-reversion to 0.5 ("max-uncertainty"). The code uses `0.20 * publication_impact`, anchoring to **0.20**, not 0.50. So extreme predictions get pulled toward 20%, not toward true uncertainty. Likely a bug.

More fundamentally: the "reflexivity" treatment is a fixed 5% nudge with no mechanism. Real reflexivity (model output → Polymarket move → market layer feedback) is not modelled. If the dashboard is widely read, the market layer becomes correlated with the model layer, and the ensemble starts citing itself.

### F13. The market layer is one thin signal
`polymarket_deal_by_jun30_pct` is the only prediction-market signal feeding the ensemble. Polymarket geopolitical contracts on this scope typically have:
- Low depth ($10K-$1M order book, not the `polymarket_volume_24h_usd: 280000000` claimed in `signals.yaml` — that figure looks ~100× too large; verify).
- Wide spreads (5-10pp).
- Reflexive correlation with Twitter/news sentiment.

Using a single illiquid contract as 20% of the deal ensemble — and 0% of every other bucket — mis-states the layer's information content.

### F14. No data freshness gating
The engine produces output every emit regardless of input age. If `signals.yaml` is two weeks stale, the dashboard still renders "D63 ENSEMBLE" with confidence labels, no staleness banner. The cron auto-commit makes this less likely in practice, but there is no defensive check in the engine.

### F15. Signals were not validated as predictive
The exotic-signals list (`friday_prayer_attendance_index`, `irgc_promotion_velocity`, `iaea_inspector_access_score`, `iranian_dark_fleet_active_tankers`, `khamenei_public_appearance_freq_30d`) reads like CIA tradecraft. None of these signals have a documented hit-rate against historical regime-fracture or escalation events. Several have known false-positive cases (Brezhnev had low appearance frequency for years pre-death; Iranian dark fleet expanded after JCPOA collapse without regime fracture).

A "leading indicator" that has not been backtested as leading is just a number.

### F16. Counterfactuals hold-other-things-equal implausibly
`counterfactual_no_trump` swaps Trump's psych profile to neutral baselines but leaves `christian_nationalist_pressure`, `gas_price_pain_index`, `midterms_proximity_days`, and the entire stakeholder cabinet untouched. A non-Trump president would have a different cabinet, a different MAGA-base dynamic, a different Hegseth (or no Hegseth). The ceteris-paribus is too strong to support the published delta.

### F17. `alphaSignals` recommends specific trades with no edge claim
`alpha_signals` (`compute.py:872-949`) emits:
- `LONG Brent calls $130 strike, 30-60 DTE`
- `LONG marine war-risk insurance carriers (Lloyd's syndicates)`
- `LONG gold (XAU); LONG bitcoin Iran-premium arbitrage`
- `LONG Iran-restoration plays (frozen ADRs proxies)`

With Kelly-fraction sizing (`advanced.py:410`).

There is no backtested edge supporting any of these. Telling a reader to lever up on options based on an unvalidated psych-modifier model — with sizing — crosses from analysis into investment advice. The risk is asymmetric: when wrong, the reader loses money; when right, the model takes credit it didn't earn.

### F18. Game theory is fabricated payoff matrices, not real Nash analysis
`game_theory_equilibrium` (`advanced.py:32-134`) constructs a 4×4 payoff matrix from linear functions of `trump_ego`, `iran_zeal`, `gas_pain` with hand-set coefficients (`8 + trump_ego * 1.5`, etc.). Then runs Nash search over this fabricated matrix.

The math (Nash equilibrium search, Pareto comparison) is correct. The inputs (the payoffs themselves) are not derived from any data. Calling the result a "Nash equilibrium" gives it a credential it has not earned. This is **scenario scoring with utility labels**, not game-theoretic analysis.

### F19. Information cascades are scripted, not modeled
`information_cascade("khamenei_dies")` returns a hand-typed nine-event script with hand-typed days. It is not derived from any model. It reads like a screenplay an analyst wrote, surfaced through a function call that looks computational.

### F20. Historical-analog matching is reflexive
The user types `korean_war_armistice_1953: 0.6` because they think this conflict resembles Korea. The engine then weights Korea highest and announces "top analog: Korean War armistice." This *confirms* the input — there is no independent matching algorithm comparing current signals to historical-period signals.

---

## TIER 4 — Selection / framing biases

### F21. Stakeholder list omits decisive actors
12 stakeholders chosen. Omitted with measurable agency:
- **Erdoğan** (Turkey, hosting talks per signals.yaml D63 notes — "Turkey 45-60 day extension offer").
- **MBS** (Saudi pivot, GCC realignment lead).
- **Xi** (China oil-buyer commitment is the single largest external lever on Iran's economy).
- **Mossad chief / IDF chief** (Israel's military decisions are not Netanyahu's alone).
- **Aoun / Berri** (Lebanon front).
- **Sistani** (Iraqi Shia clerical authority — single biggest external Shia legitimacy check on Khamenei).

Each omission is a vote not cast. Adding them would change every downstream output.

### F22. Outcome bucket choice forecloses important paths
`{deal, escalation, protracted, intervention, other}` — but:
- `regime_collapse` exists in Monte Carlo output but **not** in the synthesized 5-bucket distribution that the dashboard headlines.
- Israel-only war (US disengaged, Israel-Iran direct) is not a bucket.
- Pakistan-mediated framework deal vs Russia-mediated vs Oman-mediated are collapsed into one `deal` bucket, but they imply very different trajectories.
- Partial-deal scenarios (sanctions relief without nuclear framework) have no bucket.

The bucket choice anchors the answer.

### F23. Confidence label is mis-named
`confidence_score = 1 - max-pairwise-disagreement * 1.5` measures **inter-layer disagreement**, not statistical confidence in the headline number. A reader sees "DIVERGENT — 19% deal" and interprets it as "the 19% is uncertain." It actually means "structural says X, historical says Y, market says Z, and they don't agree."

The 19% itself has no CI. This is a different concept from confidence.

### F24. Headline ensemble pulls toward consensus
`resolution_probability` (`compute.py:155`):
```python
estimate = round(model_pct * 0.4 + polymarket * 0.4 + base_rate * 0.2)
```
40% weight on Polymarket means the model **cannot strongly disagree with prediction markets**. The worst sin in forecasting is the inability to call the market wrong when you have an actual edge. (See F1: there is no validated edge yet — but the architecture forecloses one even if it existed.)

---

## TIER 5 — Operational / governance

### F25. No engine-version → behavior diff
Calibration snapshots are tagged `engineVersion: 0.1.0`. When the engine bumps to 0.2.0, replays of past signals will produce different outputs. There is no:
- Diff harness (replay yesterday's signals on new engine, surface deltas).
- Deprecation policy.
- Mapping of which past snapshots can/can't be re-scored under a new engine version.

### F26. Snapshot reproducibility is not tested
There is no CI test that asserts: "given signals.yaml from 2026-04-29, engine v0.1.0 produces the same war-data.json today as it did on 2026-04-29." Pytest covers individual functions, not end-to-end determinism.

### F27. Auto-cron writes to production with no human gate
The 7am ET cron fetches news, edits `signals.yaml`, runs the engine, commits, and pushes to main. Vercel auto-deploys. If the news-fetcher hallucinates (LLM news summarization is not reliable) or mis-classifies an event, the bad signal is published before any human sees it.

### F28. `polymarket_volume_24h_usd: 280000000` looks 100× wrong
Total Polymarket 24h volume across **all markets** is typically $20M-$50M. A single Iran market clocking $280M/day is implausible — almost certainly off by two decimal places, or the field captured the wrong metric (open interest? cumulative? all-Iran-tagged-markets?). Worth verifying before this number is cited anywhere.

---

## RECOMMENDATIONS

### P1 — Stop overpromising until calibration exists

**R1.** Add a prominent, persistent "EXPERIMENTAL — UNCALIBRATED" badge to the dashboard header. Remove the "Predictive Intelligence Briefing" tagline until at least 3 outcomes have resolved and Brier scores can be reported. Replace with "Structured Scenario Analysis" or similar.

**R2.** When `score_override` is set on any condition, render a visible chip next to that score: "HUMAN OVERRIDE — engine bypassed." Today, four such overrides are live and the UI does not announce it.

**R3.** Move `alphaSignals` (specific trade recommendations with Kelly sizing) behind an opt-in dev flag, OR retitle them clearly: "Hypothetical asymmetric-payoff scenarios — no backtested edge. NOT INVESTMENT ADVICE." Current presentation is dangerous.

**R4.** Add a freshness gate: if `signals.yaml` mtime is > 36h, render a banner "Signals stale — last update [N] hours ago" and reduce all probability decimals shown.

### P2 — Fix the math that biases

**R5.** Replace independence-assumption joint probability for `negotiated`. Use a calibrated joint or a copula. At minimum, document the bias direction in the methodology surface.

**R6.** Make ensemble weights bucket-symmetric. When the market layer has no signal for a bucket, do **not** drop+renormalize — explicitly carry "no market signal" as zero-weight on that bucket, and keep structural/historical weights matched across all buckets.

**R7.** Smooth threshold cliffs in `forward_projections`. Replace `if x > 0.2 else` step functions with smooth direction signals (signed magnitude with a deadband, or sigmoid).

**R8.** Single-source-of-truth for stakeholder fields. `khamenei.religious_zeal` should enter the output through **one** scoring pathway, not five. Consolidate `psych_modifiers`, `game_theory_equilibrium`, `deep_read`, `ultraread`, and `forward_projections` so each input contributes once with a documented weight.

**R9.** Sensitivity-test every magic coefficient. Sweep each ±50% and record which outputs move >5pp. Output a `sensitivity.json` per emit. Coefficients with high leverage need either citation or interval-typed values that propagate uncertainty.

**R10.** Re-prior the crystallization triggers from sourced base rates. Khamenei actuarial → SSA/WHO life table for 86-year-old males. Gas-price thresholds → historical gas-price daily-move distribution. Document the source on each trigger.

### P3 — Build the calibration that makes claims defensible

**R11.** Create a backfill workflow: when an outcome resolves (`actuals.yaml`), replay the engine on every snapshot before that date, compute Brier, attach to the outcome record. This is the only way the calibration spine ever becomes meaningful.

**R12.** Lock snapshot reproducibility in CI. Add a test that fixes a past `signals.yaml` and asserts `assemble()` produces a stable hash, per engine version.

**R13.** Add a `signals_provenance.yaml` that requires a source link or note for every numeric input. Refuse to emit if a required field is unsourced. Forces hygiene.

**R14.** Verify `polymarket_volume_24h_usd: 280000000`. If it's off by 100×, every downstream "market depth" reasoning is wrong.

### P3 — Address selection / framing

**R15.** Add stakeholder slots for Erdoğan, MBS, Xi, and Sistani. Default to neutral profiles. Surface them in the deep-read once populated.

**R16.** Add `regime_collapse` and `partial_deal` to the synthesized outcome distribution. Currently these exist only in Monte Carlo or are folded into `other`.

**R17.** Rename `confidence_score` to `inter_layer_agreement` (because that's what it measures). Reserve "confidence" for a future statistical CI on the headline.

**R18.** Drop the headline ensemble formula's hard-coded 40% Polymarket weight. Either Polymarket is liquid+credible enough to override the model (then weight conditionally on volume/liquidity) or it isn't (then drop it from the headline). The current 40/40/20 makes strong disagreement structurally impossible.

### P3 — Operational hygiene

**R19.** Auto-cron writes to a `proposed-signals` branch, not `main`. Human approves the merge. If unattended commits to main are kept, add a post-commit invariant check (e.g., no signal moved >0.3 in 24h without an explicit `signals.yaml` `change_reason` field).

**R20.** Add an engine-version diff harness: when `engine.__version__` bumps, replay every snapshot in `engine_history.json` and surface (a) the average shift per bucket, (b) any snapshot whose top-bucket changed. Block deploys where shifts exceed a threshold without an explicit changelog entry.

---

## ONE-PARAGRAPH VERDICT

The architecture is ambitious and the surface area (engine, schema, calibration spine, advanced models, market fetchers) is impressive. None of that compensates for the fact that the system is currently (a) overridden by hand at the score-input layer in production, (b) unvalidated against any resolved outcome, (c) presenting opinions of one analyst as if they were ensemble probabilities, and (d) dressing fabricated payoff matrices and invented Bayesian likelihoods in formal language. The path to a defensible product is not more layers — it is fewer layers, sourced inputs, calibration data, honest UI labels about what the numbers are and aren't, and removal of trade-recommendation output until there is a backtested edge. None of that requires major rewriting. It requires deciding to ship a smaller, more honest claim.
