Prompty — Wizard Results

Wizard · Results

3 models judged

total cost · $0.0080

Consolidation Analysis

Meta-evaluation · 3 models compared against ideal responses

AI evaluation completed

Recommended Winner

ANT claude-haiku-4-5

Ideal Match 80/100

Most comprehensively captured the net-45 term, two-quarter ROI condition, SOC 2 / no-audio constraints, and Feb 1 go-live implications — with cleaner evidence fidelity than competitors.

Use This Model

Complete Rankings

claude-haiku-4-5

gpt-5.4-nano

gemini-3.1-flash-lite

Dimension Scores · Winner

claude-haiku-4-5

Task Completion

Accuracy

Format Compliance

Completeness

Precision

Strengths

Comprehensive coverage of stakeholders, risks, and next actions with appropriate evidence
Correct stance assignment with neutral handling for conditional supporters

Areas to Improve

Evidence quotes occasionally exceed length limit — minor exact-match issue
Missed converting "2–3 weeks" timeline into due_in_days field

Auto-Improve Loop

Iteratively rewrites the prompt based on judge feedback — plan → generate → test → judge → repeat.

Start Loop →

Three models. One winner. With evidence to back it up.