AI Math Problem-Solving Still Falls Short of Human Experts in Rigorous First Proof Test
The First Proof project tested four AI systems on ten original, unpublished research-level math problems created by mathematicians for this purpose. All problems were never included in any model's training data, and solutions were scored by anonymous expert reviewers from relevant fields. The AI responses showed frequent hallucinations and a critical absence of literature citations, failing to reference any sources. The evaluation confirmed that current reasoning models cannot yet match top human mathematicians. This was the first assessment to simultaneously satisfy three key standards: frontier math problems, no training data leakage, and expert human evaluation.