Meta-evaluation · 3 models compared against ideal responses
AI evaluation completed
Recommended Winner
ANT
claude-haiku-4-5
Ideal Match80/100
Most comprehensively captured the net-45 term, two-quarter ROI condition, SOC 2 / no-audio constraints, and Feb 1 go-live implications — with cleaner evidence fidelity than competitors.
Use This Model
Complete Rankings
1
claude-haiku-4-5
80
2
gpt-5.4-nano
68
3
gemini-3.1-flash-lite
62
Dimension Scores · Winner
claude-haiku-4-5
Task Completion
90
Accuracy
78
Format Compliance
85
Completeness
92
Precision
72
Strengths
Comprehensive coverage of stakeholders, risks, and next actions with appropriate evidence
Correct stance assignment with neutral handling for conditional supporters
Areas to Improve
Evidence quotes occasionally exceed length limit — minor exact-match issue
Missed converting "2–3 weeks" timeline into due_in_days field
Auto-Improve Loop
Iteratively rewrites the prompt based on judge feedback — plan → generate → test → judge → repeat.
Start Loop →
Three models. One winner. With evidence to back it up.