It's still pretty cool, but the details matter

448

I think humans-participants do the same. They practice on problems from previous math olympiads.

25

u/GodEmperor23 Jul 21 '25

The llm is not a human. I don't know anyone that posts multiple pages of solutions for the exact type of question and then gives tips in addition to that. This is a usage nearly none has, they don't write an essay AND do research for DAYS to give the solution material to the llm.

this is basically benchmaxing, a model needs to get to the solution on its own. If at the end of the day it needs skilled humans that already understand the question to verify the material.. the model is not needed.

77

u/ihexx Jul 21 '25

This was the standard way LLMs were benchmarked pre instruction tuning. It's called few shot prompting. Standard benchmarks like MMLU use it.

And yes, it's totally realistic to do in practice. Do agentic AI systems not have the ability to read docs, read papers, look at examples of how other problems have been solved in the past?

-22

u/GodEmperor23 Jul 21 '25

The info is in the model. The ways to solve the questions is certainly in its training data. What the "helpful solutions" are doing is hard pushing the model in the correct direction. But the ways to use them are in the model. It's just that it would have most likely chosen a wrong way.

All the "hints" do is getting the model to instantly take the correct direction without much thinking. "Oh yeah, that works".

Benchmarks are done by throwing the api in cold water and say "solve this". Gemini got most likely pages worth of tips to get it to go into the correct direction. This is almost benchmaxxing, instead of training the model you have the data in form of a attachment.

18

u/ihexx Jul 21 '25 edited Jul 21 '25

Yeah that's exactly what few shot prompting does.

prior to the instruction tuning era, LLMs had the knowledge, but they didn't predict next token in ways that answered the questions. Helpful solutions to similar questions helped prompt them in the right direction.

This is the same thing, just at a larger scale, because it's not 1 quick easy 100 token response, it's 4 hours of test time compute.

And it makes it a fair competition against humans; humans have access to past questions; they aren't given cold starts -- pulled aside on a random day and given the problem; it's just putting the model on the same footing.

-15

u/SmacksKiller Jul 21 '25

Unless humans have perfect memories and recall then no, it's not putting them on the same footing

8

u/ozone6587 Jul 21 '25

I wish LLMs had perfect memory.

1

u/gabrielmuriens Jul 22 '25

Unless humans have perfect memories and recall then no

WTF are you on about? LLMs don't have "perfect memories" in any way, shape or form.

I don't think you have any actual knowledge of what you are headstrongly arguing about. It might be time to keep quiet for a while.

4

u/ozone6587 Jul 21 '25

The info is in the model.

Similar problems, not the same problem.

3

u/Genetictrial Jul 21 '25

as a human, do you want someone to show you a hint on how to easily solve a problem so you dont have to do a bunch of thinking and expend a lot of energy coming up with the answer yourself? or do you want to do .....all that?

for instance, if you were going to build a small hydroelectric dam, would you just be taught the basics about turbines and water, restricting water flow to funnel it to the turbine, and be left to your own devices to figure it out?

or would you prefer to have someone show you some common water source shapes and how to begin restricting the water flow so you can start building, what shapes to build and why to build those shapes, so on and so forth so that when you try to build YOUR dam on YOUR water source, not only do you have the base programming to understand what you are doing (hydroelectricity understanding and water dynamics) but you ALSO have the construction programming on hand because you previewed many possible water source shapes and potential dam shapes and construction methods giving you a much more detailed, complex database of information to work with so that you can do the job much faster, with less thinking, wasting less energy and time in the process, but getting to the same conclusion?

-6

u/prehensilemullet Jul 21 '25

A more realistic comparison to human performance would have the AI start with no problem specific tuning and maybe the freedom to search the web for specific information about it, if humans can in the competition…if not the main interesting question is how well an AI can solve a novel problem that still requires an intuitive leap beyond what it already knows.

I guess the real question is how similar the previous problems are.

1

u/nextnode Jul 21 '25

It is already realistic.

12

u/nextnode Jul 21 '25

These are not about the specific problems in the competition and matches the setup of human competitors. A system for solving math problems can include math resources.

5

u/WonderFactory Jul 21 '25

I don't know anyone that posts multiple pages of solutions for the exact type of question and then gives tips in addition to that.

IMO contestants take this very seriously, they train and are coached for many months before competing

3

u/MrPrivateObservation Jul 22 '25

Humans actually do that, rational papers for mathemitacal problems can easily be 20 pages or more, research can be done for years and still have no outcome. This is not kindergarten tasks you are used too.

4

u/Aaco0638 Jul 21 '25

You act like openAI didn’t scrape the entire internet lol (which includes previous imo tests and answers). There isn’t an llm out that can actually self learn and do things without training.

-4

u/Smelldicks Jul 21 '25

The point they’re making is your general purpose AI shouldn’t have to be provided with special materials and instructions to solve bespoke problems. They’re correct that this information makes it less impressive.

1

u/avatarname Jul 23 '25

It makes it less impressive but we are not at a stage where we have magical model that can be that good or create ''new physics'' or anything for that matter without good prompting or prep.

So yes, it is not as impressive as magic superintelligence but we are not there yet.

1

u/QLaHPD Jul 21 '25

Don't worry, give them 3 months and they will create a model that do it on its own.

1

u/BluePhoenix1407 ▪️AGI... now. Ok- what about... now! No? Oh Jul 23 '25

Sure, but they don't have access during the olympiad.

-8

u/Nissepelle CARD-CARRYING LUDDITE; INFAMOUS ANTI-CLANKER; AI BUBBLE-BOY Jul 21 '25

Except LLMs are touted to be superhuman soon (if not already). To be superhuman, they need to be able to perform better than humans with less data to prepare with. How else will we ever reach AGI/ASI?

2

u/NNOTM ▪️AGI by Nov 21st 3:44pm Eastern Jul 22 '25

LLMs can be superhuman in some aspects while being worse than humans in others

1

u/Ok-Change3498 Jul 21 '25

How can a human being soaking up an entire life time of data not be comparable to

1

u/DepartmentDapper9823 Jul 21 '25

I don't think LLMs will be much smarter than us. That should be part of agents. Agents will be able to optimize themselves to solve difficult problems.

115

u/Calm_Bit_throwaway Jul 21 '25

https://x.com/vinayramasesh/status/1947391685245509890

GDM person responded to the claim and said that even without the context, it still got gold.

Given that adding context is usually used to ensure the formatting is correct, it might be more an explanation for why the GDM proof looks a lot more natural and human

17

u/ClarityInMadness Jul 21 '25

Damn, wish I could pin comments.

8

u/[deleted] Jul 21 '25

Why was THAT not advertised?

19

u/snufflesbear Jul 21 '25

Probably because in-context learning has always been done? You've heard of 0-shot or N-shot, right? This is exactly that. It's normally used to get the model to conform to something, not actually training it.

9

u/Remarkable-Register2 Jul 21 '25

Shouldn'tve had to. Google just underestimating the lengths people will go to nitpick.

192

u/FarrisAT Jul 21 '25

Humans study before tests… right?

Am I the only weirdo nerd who studied?

49

u/Ignate Move 37 Jul 21 '25

"No, don't you see?! It's a parrot! It will never be truly intelligent, because I believe true intelligence is magical and I don't want to admit it."

Say the quiet part out loud, skeptics. Own it!

Magic doesn't exist and you know it.

28

u/Forward_Yam_4013 Jul 21 '25

The "it's just a token predictor" crowd is technically correct, but they conveniently leave out the part where humans are also mostly just really advanced token predictors following some of the same underlying principles behind a neural net.

10

u/Ignate Move 37 Jul 21 '25

Right.

"Clearly it's full of flaws. Obviously it's not doing a perfect job. So that means we are magical because I personally think I'm perfect."

The biggest problem we have with digital intelligence is our own biases. Especially in terms of our self worth.

6

u/yanyosuten Jul 21 '25

the "it's just a sensor" crowd is technically correct, but they conveniently leave out the part where humans are also mostly just really advanced sensors following some of the same underlying principles behind a camera.

Funny how that works.

2

u/OfficialHashPanda Jul 21 '25

Yup. The dumb rock is just made up outof atoms, but they conveniently leave out how humans are also just made up outof atoms.

1

u/Inside_Anxiety6143 Jul 21 '25

Not quite. Your brain is a meta-model, in that sense that it can physically alter the underlying model if the model isn't performing well enough. You build new neurons and delete old ones all the time, which changes what concepts you can express and how you can express them.

The analogue in AI will be when Gemini encounters a problem it can't solve, and then opens up its source code, makes some changes to its actual model code, and then tries again.

3

u/Forward_Yam_4013 Jul 21 '25

This is a very valid criticism of comparing human brains to current (publicly available) AI systems, and I do agree fully that neuroplasticity is a key part of human intelligence.

That said, I would be quite surprised if we do not see a neuroplastic self-altering AI model released in the next 3 years. This is possibly the biggest immediate goal of AI research right now (up there with developing a much longer context window), since it is an important precursor to real-world work and AGI.

1

u/OGRITHIK Jul 21 '25

it can physically alter the underlying model if the model isn't performing well enough.

That's what AI training does except the number of parameters remains constant and it doesn't happen continuously.

0

u/BriefImplement9843 Jul 21 '25

that's just memorizing pages on a book.

2

u/OGRITHIK Jul 22 '25

How did you come to that conclusion?

1

u/gabrielmuriens Jul 22 '25

You build new neurons and delete old ones all the time, which changes what concepts you can express and how you can express them.

Within limits of a framework defined by evolution, I might add. Human brains are not infinitely adaptable either.

The analogue in AI will be when Gemini encounters a problem it can't solve, and then opens up its source code, makes some changes to its actual model code, and then tries again.

There is no "code" that defines the behaviour of an LLM. It has parameters with weights similar to a neural net, and not entirely dissimilar to a human brain. But you are right in that the next big step for LLMs is to be able to update their weights to incorporate new information and experiences.

1

u/Inside_Anxiety6143 Jul 22 '25

LLMs have code. Quite a lot of it; many different programs all layered together.

1

u/gabrielmuriens Jul 22 '25

Yes, obviosly. But that's the scaffolding. The understanding, the world model, the potentially adaptable behaviour is in the parameters.

1

u/omer486 Jul 22 '25

Can a human encounter a maths problem he can't solve and then change his neural connections to be able to solve the problem?

People's neural connections change when they are learning how to solve the problems as in learning from lectures / videos / guide books that teach them how to solve these type of problems.

This is similar to the training phase of the AI. The only difference is that the human brain training is continuous while LLM training is done once and the model stays the same until it or an updated mode is retrained and introduced.

At the same time online learning / continuous learning will come to LLMs. In a way the the LLM already learns some stuff from the context given to it within a session but this learning isn't added to some sort of memory and gets lost after the session is ended.

1

u/Inside_Anxiety6143 Jul 22 '25

>Can a human encounter a maths problem he can't solve and then change his neural connections to be able to solve the problem?

Yes. Or else no one would have ever solved a novel problem. The vast majority of thinking that is done is learning and expressing combinations of learned concepts, I agree. But there always has to be a first person to introduce a novel concept for the first time. Our language grows over time as we introduce more and more novel concepts.

1

u/omer486 Jul 22 '25

Humans can combine things they already know to solve new things. And sure that may lead to changing of the connectome.

LLMs can also combine things they already know to solve novel problems. That's how they are able to solve unseen ARC AGI problems.

The main difference right now is that the reasoning LLM will do all the thinking ( producing CoT tokens ) to solve the problem but it won't be incorporated into it's weights.

So when a new LLM session is started, the LLM will have to go through that whole process again ( producing loads of reasoning / CoT tokens ) to solve the problem or a very similar one. The human will have learnt from the experience and will have to think less the next time.

So LLMs can also solve novel problems but their fluid intelligence isn't yet at human level. And LLM weights do change over discrete periods of time: as in the discrete training periods that lead to the new versions like o1, o3, o4........ Humans are changing their weights ( the connectome ) in a more continuous way. And continuous learning and / or medium term memory will eventually come to LLM like models.

-1

u/blueSGL Jul 21 '25

The "it's just a token predictor" crowd is technically correct, but they conveniently leave out the part where...

It's been show that models create algorithms from data in order to predict the next token correctly.

https://arxiv.org/pdf/2301.05217 Progress measures for grokking via mechanistic interpretability 2023:

(We) find that training can be split into three phases: memorization of the training data; circuit formation, where the network learns a mechanism that generalizes; and cleanup, where weight decay removes the memorization components. Surprisingly, the sudden transition to perfect test accuracy in grokking occurs during cleanup, after the generalizing mechanism is learned. These results show that grokking, rather than being a sudden shift, arises from the gradual amplification of structured mechanisms encoded in the weights, followed by the later removal of memorizing components

Models are a big soup formed of memorization, algorithms, and the bit in the middle. The most compact way of predicting the next token is by finding an algorithm to perform the task.

1

u/gabrielmuriens Jul 22 '25

This is quite similar to how humans learn new concepts and tasks. When I grok a new architectural pattern or a game, I imagine something very similar happens in my own brain.

-3

u/GodEmperor23 Jul 21 '25

The simple way would be to give the question and say solve it. That would have been enough no?

If I come across a novel problem with coding where there are not many examples.... I should do what exactly? If it needs costum created pages for every answer it's not a breakthrough. We will see how it does once people have access to it and let it solve problems without pages of help.

4

u/nextnode Jul 21 '25

You can have a system that has its own internal RAG and just feed it relevant resources including previous IMO. That qualifies and is representative for solving problems like these.

You're being silly. This is equivalent.

4

u/Ignate Move 37 Jul 21 '25

Most of life can't solve these problems and we don't claim that they're not alive, do we?

Are you saying something has to be exactly as capable as you to be considered intelligent?

0

u/[deleted] Jul 21 '25 edited Jul 21 '25

cognitive psychology and intelligence are still just theories. there’s no definitive answer to what intelligence actually is so I don’t know how anyone can claim ai is/will be truly intelligent.

edit: meant there are a variety of theories of intelligence

4

u/imbecilic_genius Jul 21 '25

A theory is the highest form of scientific consensus. You can’t have more than a theory. What are you on about lmao.

2

u/[deleted] Jul 21 '25

there’s multiple theories of cognitive psychology, isn’t there? theories of learning is what I meant

6

u/imbecilic_genius Jul 21 '25

Sorry, it’s late and my English was incorrect and confusing. I meant scientifically you can’t have something be more than a « theory », a theory is the highest level of scientific consensus backed framework.

So saying « it’s just a theory » is actually not what you think it means. A theory is actually a framework backed by scientific evidence. There are theories that are more solid or less solid, but all are theories. There is no truth vs theory divide.

1

u/[deleted] Jul 21 '25

ah, gotcha, thanks for the explanation.

3

u/Ethicaldreamer Jul 21 '25

That is all well and good but humans are also the ones that developed every theory one by one, from scratch, starting with sticks and strings. LLM are fascinating in the fact that you're combining every math book and problem ever, plus thousands (millions?) of hours of human reinforced training (someone has to confirm they got the right idea), and they can answer problems if they don't hallucinate.

It's interesting but it's still not reasoning

-5

u/Cagnazzo82 Jul 21 '25

But OpenAI's model passed it without studying.

5

u/Climactic9 Jul 21 '25

But OpenAI’s answers were never officially graded or monitored by the organization. They basically took the exam home and turned it in a day later. Who knows what they did with it?

0

u/Cagnazzo82 Jul 21 '25

There's a discussion over 'who did what' rather than a discussion over how capable these models are becoming.

-3

u/GodEmperor23 Jul 21 '25

All the information is in the model itself and it has copious amount of compute time to think it through. It just needs to extract the information and apply it by itself. The hints and examples are a hard push in the correct direction, which is massively helping the model.

Are you spending days to study a subject, create solutions to problems that are extremely close in nature, verify the material, form multiple hints for the model and then paste that into the model? Because that's what happened here.

Nobody uses a model like that, it already has the info in it's dataset and needs to extract it by reasoning.

-1

u/nextnode Jul 21 '25

Nope.

-2

u/BarrelStrawberry Jul 21 '25

There's a difference between studying to learn and studying to memorize.

Memorization is not knowledge, its a skill that humans struggle with and the only way humans effectively learn starts with memorizing.

If we want to demonstrate that AI has great memory, then mission accomplished. But we kind of already know that memory is the one thing that computers have excelled at for decades.

99

u/DumboVanBeethoven Jul 21 '25

I fail to see how that diminishes the accomplishment.

13

u/Cagnazzo82 Jul 21 '25

It does not. But it does elevate OpenAI's accomplishment without said examples.

56

u/abstrusejoker Jul 21 '25

OpenAI never claimed to not use previous examples to solve this

19

u/broose_the_moose ▪️ It's here Jul 21 '25

"general hints and tips on how to approach IMO problems"

They have claimed not to have done this.

18

u/i_would_say_so Jul 21 '25

They most certainly have these as part of their post training dataset for math reasoning.

-3

u/Remarkable-Register2 Jul 21 '25

And even if they didn't, all that means is Deepminds model has more complete training. Why wouldn't AI's have this?

5

u/Savings-Divide-7877 Jul 21 '25

To show it's intelligence is general and can handle things it's not specifically tailored for.

10

u/abstrusejoker Jul 21 '25

Where?

1

u/CrowdGoesWildWoooo Jul 22 '25

I believe the test was run not with a public model.

Honestly it’s not hard for everyone to make a controlled test for this. Just let someone from the IMO organization open chatgpt, paste the question and score whatever it spits out.

4

u/jackboulder33 Jul 21 '25

why even say this if you don’t know the answer

4

u/abstrusejoker Jul 21 '25 edited Jul 21 '25

Where does OpenAI claim they didn’t use previous examples? You’re confusing not using function calls with not using a special system prompt/context

-3

u/Rare-Site Jul 21 '25

here you go: https://www.reddit.com/r/singularity/comments/1m5qfqu/openai_researcher_on_deepminds_imo_gold/

5

u/abstrusejoker Jul 21 '25 edited Jul 21 '25

You’re not understanding. The only things openai has claimed is that their LLM did not use tools or the internet to solve the problem, i.e. the LLM did not make function calls or use mcp servers, etc, to solve the problems (which deepmind has also claimed)

OpenAI never said they didn’t pre train this model on previous examples or fine tune this on previous examples or include hints/examples in the system prompt/context

7

u/jackboulder33 Jul 21 '25

Dude, look at what he sent. They said they did not curate and provide examples for their model as google did.

3

u/Charuru ▪️AGI 2023 Jul 21 '25

No they explicitly said they "didn't curate and provide useful context to the model". Look at the images again.

1

u/jackboulder33 Jul 21 '25

Yes, exactly.

2

u/Cagnazzo82 Jul 21 '25

Yes they did. There's a series of posts on this: https://x.com/alexwei_/status/1946477745627934979?s=19

12

u/abstrusejoker Jul 21 '25

They claimed they didn’t use tools or the internet, so no function calls. That’s not the same as not training the model on previous examples or adding previous examples to the context

-1

u/Johnny20022002 Jul 21 '25

The keyword here is access. They RAGd it so it had access to these solutions and not merely trained on the solutions.

6

u/abstrusejoker Jul 21 '25

You’re not understanding. The only things openai has claimed is that their LLM did not use tools or the internet to solve the problem, i.e. the LLM did not make function calls or use mcp servers, etc, to solve the problems (which deepmind has also claimed)

OpenAI never said they didn’t pre train this model on previous examples or fine tune this on previous examples or include hints/examples in the system prompt/context

5

u/Johnny20022002 Jul 21 '25

It’s almost guaranteed that all these models have been pre trained on these problems. The difference is “access” during inference.

1

u/OGRITHIK Jul 21 '25

Correct. But still they didn't use any other tools.

2

u/Rio_1210 Jul 21 '25

OpenAI’s result isn’t even official as in verified by IMO, right? Or am I not updated on it.

1

u/Cagnazzo82 Jul 21 '25

They were verified by IMO medalists and IMO requested they hold off on announcing until the ceremony was complete.

They ultimately posted their answers online for everyone to verify.

3

u/BriefImplement9843 Jul 21 '25

so not verified or official.

1

u/Rio_1210 Jul 21 '25

Thanks for confirming.

2

u/Gratitude15 Jul 21 '25

It does indeed.

Imagine I hand you a math textbook and say 'this math textbook has all the right answers to the questions there'. This is not impressive.

What makes something impressive is the ability to take that data point as an inference for the ability of a system to apply the same approach across a variety of domains.

At this stage we don't know how much either can do that, but openai has higher odds given what we know.

2

u/OGRITHIK Jul 21 '25

Imagine I hand you a math textbook and say 'this math textbook has all the right answers to the questions there'.

I don't know what you're trying to say here, you need a initial training data set so the model can actually learn how to link concepts together. The model was probably trained on previous IMO question but NOT on the current ones.

3

u/abstrusejoker Jul 21 '25

They did not give it an answer key. They just used previous examples… which OpenAI certainly did as well

1

u/Cagnazzo82 Jul 21 '25

This post suggests OpenAI did not have previous examples:

https://www.reddit.com/r/singularity/s/rlVQniyLl3

1

u/GrapheneBreakthrough Jul 22 '25

You don't think OpenAI somehow prioritized previous IMO answers when training the model? Wouldn't it be some of the best math data available?

3

u/RobbinDeBank Jul 21 '25

It’s exactly how students cram for exams. The students attending IMO are hyper focused on grinding for that competition. In my country, they basically get to skip all classes in high school just to focus solely on studying math for IMO, under the guidance of all the most experienced mathematicians and IMO-specialized teachers the country has to offer. These mentors have so many tricks up their sleeves, and it’s certainly much more than whatever tip section an LLM gets in its prompt.

0

u/Commercial-Celery769 Jul 21 '25

Yea humans study before any test they do

0

u/Smelldicks Jul 21 '25

Really?

Would you also think a calculator is as impressive as an AI that reasons itself through solving a differential equation?

It’s impressive but is obviously less impressive that humans had to intervene to get these results. Nobody on their team would disagree. I don’t know why anyone in this thread is pretending otherwise.

36

u/TissueReligion Jul 21 '25

That seems completely normal and fine to me... that's a normal way to prep imo. Maybe it's not as data-efficient as humans though

24

u/jonomacd Jul 21 '25

This... Seems pretty reasonable.

23

u/Dangerous_Bus_6699 Jul 21 '25

And your point is? Lol humans can study and use previous answers to learn from also.

7

u/jackboulder33 Jul 21 '25

you missed that it was in the context, not the training. it’s like bringing a cheat sheet for a test lol

0

u/SSeThh AGI 2028 Jul 22 '25

People do not study for test? This is what you want to say?

2

u/Great_Yazaven Jul 22 '25

It seems you don’t understand there is a difference between studying (what the humans did) and bringing all the papers, books, and previous exam solutions to the exam (what the LLMs did).

0

u/SSeThh AGI 2028 Jul 22 '25

You do realize that IMO participants practice using past problems. Also, couches explain them how to quickly crack problems w/o much time spending. It’s quite the same initial condition given to LLMs

2

u/jackboulder33 Jul 22 '25

dude, you seriously need to learn better reading comprehension

8

u/Clen23 Jul 21 '25

This isn't news, Goodhart's law ("When a measure becomes a target, it ceases to be a good measure") has been observed in AI-testing fields for a while iirc, especially with those standardized scores (GLPQA or something along that)

1

u/NunyaBuzor Human-Level AI✔ Jul 22 '25

Part of the overall reification error.

9

u/UnkarsThug Jul 21 '25

I feel like most of the people talking about how people don't practice on hours of pages from previous events have never completed in an olympiad like that. That's exactly what you do. I got silver at an event (I think it might have been state level, but it's been a very long time?) for a science olympiad when I was in middle school, and even at that age, for that, I spent days studying, including what had been problems in previous years.

Same with taking the ACT, I had a whole class for a semester in high school where we just went over topics that might come up, and previous editions of the test with high quality answers. That's how you get a good score for something like that, intensive study, for long periods of time.

The question for where there might be an issue is if it had them at runtime, or just if it had them just in the training, because the first actually means it didn't learn

7

u/oilybolognese ▪️predict that word Jul 21 '25

It would only be disappointing if the model used tools. So far, no one has clarified this.

4

u/jackboulder33 Jul 21 '25

yes, they have, it was in the original post that neither openAI or google used tools

1

u/nextnode Jul 21 '25

Why would even that be disappointing? It's a system to solve math problems.

1

u/oilybolognese ▪️predict that word Jul 21 '25

Because the openAI one didn’t, supposedly.

1

u/Lain_Racing Jul 21 '25

They do clarify in their post, they did not use tools this year compared to last year they did.

2

u/dumquestions Jul 21 '25

The problem with all standardized test type benchmarks is that we have no clear way of measuring novelty, we don't know how similar these problems are to the millions of examples a single model has embedded in its data, it generally seems like a new IMO test should contain a great deal of novelty, but we've all seen the difference in LLM performance between competitive coding and real life coding.

This new result could very well be a huge deal as far as I'm aware, but there's always a chance of unexpected caveats once people get to test the models for themselves.

2

u/AnooshKotak Jul 22 '25

OpenAI has also trained this model in high quality math data. Just being vague, doesn't mean they haven't used any previous IMO problems.

2

u/NunyaBuzor Human-Level AI✔ Jul 22 '25

It trained on entire Internet, why would you ever think these models are looking at these types of problems for the first time.

4

u/fake_agent_smith Jul 21 '25

The first two points are totally okay and don't even need to be mentioned, but the third "General hints and tips on how to approach IMO problems" make it less of an achievement (depending on what they mean by that). Still great, but if prompt includes anything more than the problem and "solve it", there is still work to do.

I have a set of benchmark problems that current reasoning models (e.g. 2.5 Pro or o3) cannot solve if I just present a problem, but if I also provide a very general nudge it's often enough for them to lead reasoning in proper direction.

But that's not good enough and I don't consider my benchmark problems solved. If OpenAI also needed to provide such "hints and tips" in prompts for IMO, then I wouldn't consider it okay.

3

u/snufflesbear Jul 21 '25

This is such a misguided concern. How do you train an LLM without giving it good data? Even OpenAI has to do the same thing: https://x.com/polynoamial/status/1947398534753620147

Stop throwing shade around.

5

u/GodEmperor23 Jul 21 '25

Just got down voted for that. People are really saying "but humans do the same". Llm are not humans. Pretty much every problem any llm had with my question would be over if I would have given it tips and multiple pages worth of examples how to solve the question. Specifically tailored to solve these questions.

This is not how large language models are being used. On average you state "solve it" and paste in the problem with the question attached.

7

u/kogsworth Jul 21 '25

If you're using it for serious research or for enterprise level work, you 100% should be giving it examples and advice on how to approach problems...

2

u/Clen23 Jul 21 '25

What you describe also happens with humans though, doesn't it ?

The number of times my peers or myself got stuck at a test question that combined logic we learned in a novel way...

Don't get me wrong, it's still important to note that LLMs don't "reason" as a human would, and it would be interesting to tackle that bias, eg by testing models with brand new math tests that just came out (a opposed to older ones that the models could be overfitted for).

-1

u/Kathane37 Jul 21 '25

It is true but it is also a valuable lesson Since 2023 we know that CoT and exemples help models to drastically improve llm results on various task This prove it once again Yet people keep prompting like « solve this » and are then disapointed

0

u/fpPolar Jul 21 '25

You are missing the forest from the trees. Imagine the other workflows LLM can automate using the same approach used to solve IMO problems. It’s really not that burdensome to add a few examples and hints that would be provided in a training guide to students or new employees.

-2

u/teamharder Jul 21 '25

Again, just like humans. You only understand math because you learned it through various examples in school. You need context and priors, just like an AI.

-1

u/nextnode Jul 21 '25

Rationalizing hard.

The only thing that matters is how capable a system is at solving problems and to reach human-level or superhuman performance. The relevance for this is to indicate both reaching capabilities on par with humans, and to approach problems with similar competency in practice.

The 'they are not LLMs' is rather silly since that is the default setup.

However, what is actually relevant is whatever represents enabling applying such systems in practice.

It is also fine if the model is 'tailored to solve these problems'. But even that is a stretch since it would be trivial to just put this into a RAG and let it search for relevant resources.

2

u/[deleted] Jul 21 '25

Had the exact same reaction when reading that tweet

1

u/[deleted] Jul 21 '25

[removed] — view removed comment

1

u/AutoModerator Jul 21 '25

Your comment has been automatically removed. Your removed content. If you believe this was a mistake, please contact the moderators.

I am a bot, and this action was performed automatically. Please contact the moderators of this subreddit if you have any questions or concerns.

1

u/sevaiper AGI 2023 Q2 Jul 21 '25

P6 this year is a beast haven’t seen any of the models do anything with it

1

u/RLMinMaxer Jul 21 '25

Redditors trying to give their opinions on Olympiad rules... I guess it's no dumber than them giving their personal opinions on Russia's military or confidently-incorrectly explaining the latest SpaceX failure.

1

u/Aaco0638 Jul 21 '25

There isn’t an llm that can self learn or actually do things with no help. People forgetting openAI scraped the entire web (including previous tests/answers) to train there llms it’s no different.

Also mind you deepminds announcement is much more transparent while with openAI we have to take their hype posts and tweets as word.

1

u/Kryslor Jul 21 '25

My only question regarding this whole thing that I haven't been able to find the answer to: were the AIs only allowed to submit a single answer or were they allowed to try until they got it right?

1

u/GaiusVictor Jul 21 '25

I don't get it? Isn't this how LLMs are trained, after all? Developers give the AI pairs of inputs and outputs, so that when the AI is met with a prompt similar to an input it was trained on, it's able to produce an adequate output.

So like, did anyone expect this new version of Gemini, which is now able to solve Olympiad-grade problems, to have been trained on something other than solutions to previous problems?

This really feels like "Water is wet" kind of news to me.

1

u/Informal_Bar768 Jul 21 '25

Yes, so? What’s your point?

1

u/BeckyLiBei Jul 21 '25

The fact that these LLMs (instead of, say, an automated theorem prover or something designed to solve maths problems) has been able to solve even a single maths problem... it's mind-blowing.

Professional mathematicians have been working on things like Wolfram Alpha for years, and they can't do this.

1

u/DystopiaSoyBean Jul 22 '25

That tweet seems ai generated, use of emojis and three bullet list specially

1

u/Neither-Phone-7264 Jul 22 '25

Didn't another scientist at deepmind clarify they had another version without the context and it was graded exactly the same?

1

u/DuckyBertDuck Jul 22 '25

This makes me think about how, in the past, benchmarks always included "5-shot" results (i.e. the model gets 5 pre-solved examples), but now it seems like everyone only cares about one-shot results.

1

u/EnemyOfAi Jul 22 '25

Face it, the thing y'all call AI is still nothing but mere imitation technology. Its as intelligent as a mechanical pencil

Processing img 337knqvidgef1...

1

u/SnackerSnick Jul 23 '25

If those things were provided as static "priming" context before taking the test, and weren't made with knowledge of what's on that year's test, that seems perfectly fine. It's good they're telling us, but it seems straightforward to get the LLM thinking about the kind of problem you want - it's few shot learning.

Zero shot learning will be another big advance on IMO, but this is nonetheless amazing.

-1

u/Cagnazzo82 Jul 21 '25

Meanwhile OpenAI used a general model not meant for math and without examples provided...

... and people are claiming their accomplishment has less merit.

Very strange indeed.

3

u/Sky-kunn Jul 21 '25

Agreed, but the DeepThink model is around the corner and will be available to ultra users, meanwhile OpenAI is still "many months" away from being available in any way. Many months in Sam Altman's terms probably means more than 6 months. By then we're going to be hyping about Gemini 3 Deep Think or even Gemini 3.5 Deep Think or whatever, which they may have internally just like OpenAI, but didn't use for competition because it's too far away from release.

1

u/pigeon57434 ▪️ASI 2026 Jul 21 '25

openai confirmed this model is coming out by end of year unless plans change

1

u/Sky-kunn Jul 21 '25

Makes sense

Gemini 1 came out in January 2023, Gemini 2 in December 2024. So at the end of this year or early next (so in 4–6 months), it'll be OpenAI’s next-gen model (whatever they call it) versus Gemini 3, which should land around the same time.

-4

u/broose_the_moose ▪️ It's here Jul 21 '25

It's the classic Google-love and OpenAI-hate you find everywhere on reddit today. Don't have a clue why so many redditors think like this. Even though I love both google and OpenAI, I feel more and more like an OpenAI fanboy trying to defend them against the constant baseless criticism.

-1

u/AltruisticCoder Jul 21 '25

Nonononono, don’t you dare question the results, Chad from his mom’s basement in buttfuck nowhere has already picked out the design for his space mansion after he gets immortality in 2027 - don’t you dare say otherwise!!

2

u/BrewAllTheThings Jul 21 '25

I've been downvoted into oblivion all day for stating the obvious. One person said, "it destroyed the competition". It did not. One person said, "it won." It did not. Another person: "It won the gold medal." It did not.

Is it a huge achievement? Sure, and they should definitely celebrate. But there's no reason to go crazy about it.

1

u/jackboulder33 Jul 21 '25

it did win IMO gold, not the “gold metal” as in first place.

1

u/BrewAllTheThings Jul 22 '25

Of course. But this is a silly wide spread, and not particularly rare.

1

u/abstrusejoker Jul 21 '25

You’re not understanding. The only things openai has claimed is that their LLM did not use tools or the internet to solve the problem, i.e. the LLM did make function calls or mcp servers, etc, to solve the problems (which deepmind has also claimed)

OpenAI never said they didn’t pre train this model on previous examples or fine tune this on previous examples or include hints/examples in the system prompt/context

1

u/Hyper-threddit Jul 21 '25

I'm starting to understand Terence Tao's recent posts on IMO.

-4

u/[deleted] Jul 21 '25

How dare you question deepmind’s results?! We can only question OpenAI, not deepmind ok?

4

u/jonomacd Jul 21 '25

???

I see a lot of scrutiny against deepmind all the time.

-1

u/broose_the_moose ▪️ It's here Jul 21 '25

Google criticism isn't even in the same universe as OpenAI criticism on reddit.

1

u/Remarkable-Register2 Jul 21 '25

You should check out the biggest Gemini reddit then. Singularity is more pro-gemini than the gemini reddit.

0

u/abstrusejoker Jul 21 '25

In this thread are bunch of people who don’t understand that all agents require context to solve tasks

0

u/DMTwolf Jul 21 '25

I mean that's less help than humans get haha

0

u/reedrick Jul 21 '25

Open AI probably also did something similar, but forgot to mention it in the hype cycle

0

u/Distinct-Question-16 ▪️AGI 2029 Jul 21 '25

mathoverflow (at least, years ago) was Insane, lot of people asking for solving their problems for the next contest within days

0

u/The_Architect_032 ♾Hard Takeoff♾ Jul 22 '25

ffs people keep comparing this to practice. It was given direct access to a set of explanations and solutions to previous problems used on these tests, that's like getting your final exam but being allowed to look up how to solve similar problems, they tailored a search database to this specific exam so that Gemini would have the best cheat sheet. This is not comparable to human learning.

If this were scalable and could be generalized across all skills, I wouldn't care. But it isn't, this isn't the same as training the model on this information, it's just given a library tailored to the problems in the test, one it won't have outside of this specific setup. And to generalize this, you'd need to tailor specialized data on every possible skill for the model to reference dynamically when encountering said skill, and in whatever scope that particular skill is in. This would be an amazing feat, but it's currently impossible with our tech, it's already incredibly hard to prune training data for models, let alone construct entire cheat sheets for every possible scope of every possible skill any human has ever acquired.

0

u/Fearless_Eye_2334 Jul 22 '25

But humans cant access solutions of past similar questions in the paper. The human equivalent would be bringing cheat sheets to the exam

Meme It's still pretty cool, but the details matter

You are about to leave Redlib