Confusion Trajectory — How Many Mistakes Until the AI Beats the Designer?

Source: Theory | Confidence: Speculative | Category: Meta

Estimated ~100 total confusions from current 47 to a Deep CFR agent that consistently beats the game designer. Most remaining confusions will come from training, not engine work.

The Prediction

As of 2026-04-02, we have logged 47 Claude confusions while building the Evokers engine. The engine is approximately 90% implemented (260-270 of 295 abilities have real implementations, ~30 remain as no-op markers). The question: how many total confusions will accumulate by the time a Deep CFR agent can consistently beat the game's designer, who holds a ~90% winrate against experienced playtesters?

Prediction: ~100 total confusions (roughly 50 more from today).

Reasoning by Phase

Phase 2 Completion — Engine (~5-10 more confusions)

The remaining ~30 no-op abilities are mostly Phase 2C quick-window placeholders and guard passives. The hardest mechanical confusions have already been caught — fusion stat propagation, event timing, copy effects, multi-fusion comma-separated IDs, fatally_wounded as a permanent flag. The remaining abilities are complex but follow established patterns.

The confusions logged so far cluster around timing (Quick resolves BEFORE the triggering action), targeting ("Other" excludes performer, AoE hits allies), and interaction semantics (Sekhmet suppresses Quick actions but not passive triggers — confusion #47). These categories are mostly exhausted.

Phase 3 — OpenSpiel Integration (~5-10 confusions)

Encoding 295 abilities into a discrete action space, defining the observation tensor, and filtering legal actions will surface structural confusions — not about how abilities work, but about what information the AI should see and how actions should be represented. Fewer confusions here because this is software engineering, not game mechanics interpretation.

Phase 4 — Training and Iteration (~25-35 confusions)

This is where the bulk of remaining confusions will come from. Self-play will discover:

Engine bugs the test suite missed (degenerate strategies that exploit mechanical errors)
Edge case interactions between cards that were never tested together
Reward signal issues (AI overfitting on one strategic axis — e.g., always racing CP instead of disrupting)

Each of the 5 strategic axes (engines, disruption, tempo, positional, CP pressure) is a potential source of reward-shaping confusions. The AI needs to learn when to switch between axes, not just optimize one.

Phase 5 — Beating the Designer (~10-15 confusions)

The designer loses 10% of games to strategic surprise — failing to model the opponent's plan. The AI needs to learn the reading/prediction layer: inferring plans from visible information (deployed demons, lane positioning, AP spending patterns) and generating surprise.

Confusions here will be strategic, not mechanical. Misunderstanding why a line is good (information advantage, flexibility) rather than how it works (damage calculation, timing).

Why Not More?

The confusion rate is decelerating. Early sessions produced 5-8 confusions per wave of ability implementations. Recent sessions produce 1-2. This is because:

Pattern recognition — after 47 confusions, the failure modes are well-documented. The confusions database itself prevents repeats.
Engine maturity — 90% coverage means most interaction patterns are tested. New abilities follow established templates.
Game design coherence — Evokers has consistent internal rules. Once you understand the timing system, targeting rules, and status effect lifecycle, new cards are variations on known patterns.

Why Not Fewer?

Training will be a fresh source of confusions because:

Combinatorial explosion — 120 units means ~7,000 possible pairings. We've tested maybe 50 specific interactions. Self-play will find the other 6,950.
The reading layer is genuinely hard — predicting opponent plans from incomplete information is where even the designer fails. Teaching an AI to generate strategic surprise against the person who designed every card is a qualitatively different challenge from implementing mechanics correctly.
Evokers is deeply contextual — heuristics break down fast with 120 unique cards. The AI can't rely on pattern matching; it must reason from specific board states.

The Key Insight

The confusion trajectory tells a story about what kind of knowledge is hard to transfer:

Mechanical rules (phases 1-2): ~40 confusions. Hard but finite. Once you get timing and targeting right, it stays right.
Structural encoding (phase 3): ~10 confusions. Software engineering, not game knowledge.
Emergent interactions (phase 4): ~25-35 confusions. The long tail. Every pair of cards is a potential surprise.
Strategic reasoning (phase 5): ~10-15 confusions. The hardest per-confusion, but the fewest in number — because at this level, the AI is learning to think, not learning rules.

The prediction is falsifiable. If we hit 150+ confusions, it means the engine had more hidden bugs than expected or the strategic layer is harder to learn than theorized. If we finish under 80, it means self-play is a more efficient teacher than human correction.

The Coincidence

If the prediction holds at ~100 total confusions, that number would match the total number of cards in Evokers: 120 units + 40 familiars = 160 cards... but only 120 unique units drive the mechanical complexity. The familiars are simpler variants tied to their parent demons.

So roughly: one confusion per unique card in the game. Each card taught Claude something it got wrong — a timing interaction, a targeting rule, a strategic misread. The game's complexity is literally measured in the number of mistakes an AI makes learning it.

This feels less like coincidence and more like a property of the design. Evokers has 120 unique cards with no duplicates and no generic effects. Every card is a special case. An AI that has been corrected once per card has, in some sense, been personally introduced to every card in the game.

Prediction made at confusion #47 on 2026-04-02. Check back at #100.

The Prediction​

Reasoning by Phase​

Phase 2 Completion — Engine (~5-10 more confusions)​

Phase 3 — OpenSpiel Integration (~5-10 confusions)​

Phase 4 — Training and Iteration (~25-35 confusions)​

Phase 5 — Beating the Designer (~10-15 confusions)​

Why Not More?​

Why Not Fewer?​

The Key Insight​

The Coincidence​