New Math Test Shows People Still Top AI for Proofs

A tougher research benchmark found humans ahead of AI in frontier math, while still revealing how useful models are becoming.

New Math Test Shows People Still Top AI for Proofs

Humans still beat AI on brutal new math benchmark is the kind of headline that lands hard after months of triumphal AI chatter. Just weeks after AI helped solve an 80-year-old Erdős problem, a tougher result arrived and forced a reset: the machines can absolutely impress, but they still do not own frontier mathematics.

That whiplash is the real story. Not that AI is fake, and not that humans are untouchable forever. The point is that a lot of AI math hype has been built on isolated spectacular moments. Then First Proof arrived, closed the obvious loopholes, and asked models to survive a full season instead of one dazzling highlight reel.

The problem with AI math hype

The annoying thing about AI headlines is they can be true and still leave you with the wrong picture.

Nature recently reported that an OpenAI model helped solve an 80-year-old problem posed by Paul Erdős on the unit-distance problem. That is objectively wild. Erdős published more than 1,500 papers and left behind open problems mathematicians have wrestled with for decades. If AI helps crack one of those, people are going to lose their minds.

They did. OpenAI's Sébastien Bubeck called it potentially the first autonomous AI result in any field of research. Tony Feng of Berkeley posted that it was incredible. He was right.

It was also not proof that AI can do research math reliably.

That is the part people keep skipping. One amazing solo does not make a system consistently elite. The boring question, which is usually where reality lives, is whether a model can do this repeatedly, under controlled conditions, with no leaked training data, no hidden steering, and no backstage help. That is what First Proof tried to measure.

The top AI system scored 6 out of 10.

That number matters less as a dunk than as calibration. Tech history is full of flashy demos that get mistaken for prophecy. Then someone runs a cleaner test and reality walks back into the room.

The earlier breakthroughs were not fake. But too many people turned a few spectacular wins into a whole theory of intelligence. That is not science. That is vibes with a seed round.

Why the First Proof benchmark matters

What makes the First Proof benchmark interesting is exactly what makes it bad for hype.

It used 10 previously unpublished research-level math problems. Not textbook exercises. Not olympiad staples. Not problems the model likely saw in slightly altered forms during training. According to Nature, the questions came from 10 researchers who had solved them in their own work but had not yet published them.

That one design choice kills a lot of benchmark theater.

Too much AI evaluation still boils down to a model solving internet problems after being trained on the internet. First Proof tried to remove that shortcut.

Then there was the grading. Thirty mathematicians vetted the answers, with anonymous specialists formally reviewing submissions. That is a huge difference. In software, something often counts as working if it survives a demo. In math, a proof survives expert scrutiny or it does not.

Nature described First Proof as the first benchmark to combine three things at once: research-level problems, not in training data, and formal evaluation by mathematicians. That is why this result hit differently. It was not another benchmark built to generate a chart for social media. It felt like a test built by adults.

First Proof also published the boring but important materials: results, solutions, logs, and referee documents. More of that. Less cinematic launch energy. More receipts.

The whole setup felt less like giving a chatbot a cute challenge and more like dropping it into a serious lab meeting with no prep and a room full of professionally skeptical experts.

So yes, Humans still beat AI on brutal new math benchmark is a strong headline. But it reads less like a victory lap than a long-overdue score correction. We finally got a test that was hard to game.

A 6 out of 10 is still a big deal

This result is going to be misread from both directions.

If you are an AI maximalist, 6 out of 10 is awkward because it is not the clean sweep you wanted. If you are an AI skeptic, six correct solutions on unpublished research-level problems should still make you pay attention. That is not trivial.

According to Scientific American, many of these were not giant headline-ready puzzles. They were often lemmas, smaller theorems that help unlock bigger results. That actually makes the benchmark feel more realistic. Real research is often ugly intermediate work, not one glorious final proof.

Mohammed Abouzaid of Stanford said the problems were chosen to require some originality. That matters because it pushes back on the lazy claim that this is all just memorization. The benchmark was designed to force systems beyond remixing familiar patterns.

So no, AI did not overtake top human mathematicians here. It did not dominate. It did not settle the debate. But if a machine can solve six of ten unpublished research-level tasks under serious evaluation, we are well past party trick territory.

That is the strange middle zone now coming into view. Not magic. Not AGI fan fiction. Just partial usefulness becoming scientifically and economically important.

Nature's broader reporting on math and physics points the same way: AI can help check proofs line by line, search for counterexamples, and propose intermediate steps. Those sound like modest tasks until you have actually done hard intellectual work. Then you realize those are exactly the tasks that consume days.

If a tool saves an expert ten hours a week by catching mistakes faster, exploring dead ends early, or producing one useful lemma, that already changes the economics. You do not need robot Gauss descending from the heavens.

That middle zone is where this gets real.

Expert review was the brutal part

The hardest part of frontier math is not generating something that looks smart. It is surviving contact with people whose job is to find where the argument cheats.

That is what made First Proof especially interesting. It did not just ask models to produce plausible reasoning. It pushed them into AI proof verification territory where the answer had to hold up under hostile inspection.

Submissions had to be in compilable LaTeX. That matters. Not vibes. Not a claim that the model appears to reason. LaTeX.

The proofs were judged by anonymous specialists in the relevant fields. Not generalists. Not fans. Specialists trained to inspect each step with suspicion.

There was also a 24-hour runtime cap on standard cloud machines, according to the First Proof report. That matters because it limits brute-force nonsense. You cannot just burn absurd compute for a week and then call it elegant insight.

OpenAI's own write-up on its First Proof submissions included the most important sentence in the whole episode: some proofs that looked promising did not hold up under expert review. That kind of honesty is useful. Less talk about models thinking. More talk about where the argument broke.

Scientific American made the same point more broadly: proof attempts that look convincing can collapse once specialists inspect the logic carefully. A proof is not a pitch deck. It does not get partial credit for sounding persuasive.

That is why this benchmark matters. It does not just test generation. It tests survivability.

A person solving a math problem on paper, showcasing human creativity and reasoning in contrast to AI capabilities.

Alt text: Humans still beat AI on brutal new math benchmark — First Proof setup with 10 unpublished research problems, expert referees, 24-hour runtime cap, and top AI score of 6 out of 10.

AI may become valuable before it becomes superior

The whole AI vs mathematicians framing is already getting stale.

The more interesting possibility is that AI becomes genuinely valuable long before it becomes clearly better overall. Nature's reporting suggests these systems can already suggest useful auxiliary results and bridge gaps in arguments. That is not replacement. That is a strange collaborator.

And there are signs this weirdness can be productive. Nature reported that Aristotle, a system from Harmonic, helped solve several Erdős problems. Axiom Math has also claimed its tool found solutions to research-level problems professionals had not solved. Startup claims always deserve caution, but the direction is hard to miss.

One detail from the First Proof reporting stands out: at least one stochastic PDE solution reportedly impressed referees with a novel approach. That is the spicy part. Humans still won overall, but if experts are seeing moves that feel genuinely new, this is not just autocomplete in formalwear.

Nature also highlighted a case involving Liam Price, not a professional mathematician, who used ChatGPT to help make progress on Erdős problem #1196. Jared Duker Lichtman compared it to AI discovering a new chess opening because of human aesthetics and convention.

That comparison lands. Sometimes expertise is a superpower. Sometimes it is inertia. Knowing the canon can help you see deeper, but it can also trap you inside what everyone agrees is elegant. A machine does not care about elegance unless people teach it to care. That can be a bug. It can also be a source of novelty.

If the next few years go a certain way, AI will not routinely prove the biggest theorems end to end. It will handle the ugly middle work: checking proofs, generating candidate lemmas, stress-testing conjectures, finding counterexamples, and surfacing cross-field ideas a tired human might not try.

That is not sexy enough for the AGI crowd.

It is also probably where the value is.

The real human edge may be taste

If humans still have an advantage in frontier math, it may not be because they will always be better at symbol pushing.

The stronger case is taste.

The hardest part of serious work is often deciding what is worth proving, what is fertile, what is elegant, and what is likely to open a door instead of becoming months of beautifully useless effort. Nature essentially says this outright: the most creative parts of mathematics still involve humans deciding what is interesting.

That feels right beyond math too. Execution gets attention because it is visible. Taste is quieter. Choosing the right problem before everyone else. Knowing which ugly direction hides real leverage. Feeling that one conjecture is dead and another is alive. That is not just IQ. It is judgment, intuition, obsession, aesthetics, and timing.

Math is a good place to expose this because it is unusually friendly to automation. As Nature noted, mathematics and theoretical physics have cheap, fast, digital experiments. No wet lab delays. No telescope queues. If humans still keep an edge here, in one of the cleanest environments for machine assistance, that edge says something.

And First Proof seemed to understand that. The interesting question is not just whether AI can beat mathematicians. It is whether AI can become useful to mathematicians by checking proofs, acting like a research assistant, and maybe eventually solving some problems autonomously.

That framing is smarter. Less gladiator arena. More actual progress.

There is also a credibility angle worth noting. According to the First Proof site, board members will not accept paid engagements from AI companies while serving, and the foundation publishes financial reports. In a field drowning in incentives to oversell, credibility is part of the product.

So yes, Humans still beat AI on brutal new math benchmark.

But the deeper message is more uncomfortable than either side wants to admit. If machines keep improving at proving, while humans keep more of the taste, direction, and standards, then being the smartest person in the room was never the whole job.

Maybe the last moat is not intelligence.

Maybe it is judgment.

Sources

Related reading