Submitted by Balance- t3_124eyso in MachineLearning

GPT-4 and professional benchmarks: the wrong answer to the wrong question

OpenAI may have tested on the training data. Besides, human benchmarks are meaningless for bots.

Problem 1: training data contamination

To benchmark GPT-4’s coding ability, OpenAI evaluated it on problems from Codeforces, a website that hosts coding competitions. Surprisingly, Horace He pointed out that GPT-4 solved 10/10 pre-2021 problems and 0/10 recent problems in the easy category. The training data cutoff for GPT-4 is September 2021. This strongly suggests that the model is able to memorize solutions from its training set — or at least partly memorize them, enough that it can fill in what it can’t recall.

As further evidence for this hypothesis, we tested it on Codeforces problems from different times in 2021. We found that it could regularly solve problems in the easy category before September 5, but none of the problems after September 12.

In fact, we can definitively show that it has memorized problems in its training set: when prompted with the title of a Codeforces problem, GPT-4 includes a link to the exact contest where the problem appears (and the round number is almost correct: it is off by one). Note that GPT-4 cannot access the Internet, so memorization is the only explanation.

925

Comments

You must log in or register to comment.

wazis t1_jdz4v8g wrote

If it is true (too lazy to check), it is not surprizing. If it is not than it is also not surprising

109

Seankala t1_jdz53mn wrote

It'd be nice to see the qualifications of the authors.

0

ghostfaceschiller t1_jdz6vzn wrote

I think this was shown awhile ago (like a week ago, which just feels like ten years)

While I do think this is important for several reasons, personally I don't see it as all that impactful for what I consider AI capable of going forward.

That's bc pretty much all my assumptions for the next couple years are based on the idea of systems that can loop and reflect on their own actions, re-edit code based on error messages, etc. Which they are very good at

76

Simcurious t1_jdzatox wrote

That's not correct, the benchmark they used only contained codeforce problems from after 2021.

From Horace's tweets: >Considering the codeforces results in the paper (very poor!), they might have only evaluated it on recent problems.

45

hadaev t1_jdzcowi wrote

Well we usually expect it from not really ds peoples like biologists using ds methods and making such a trivial mistake.

It doesnt seems hard to search matches in text. Unlike other data types.

13

mrdevlar t1_jdzdi2t wrote

Proof that no matter where you go, it is always going to be possible to make simple mistakes.

7

master3243 t1_jdzec5r wrote

Seeing how they made sure the bar exam and the math olympiad tests were recent ones that were explicitly stated to not be in the training dataset to avoid contamination, I trusted that all the other reported tests were also as carefully picked to avoid contamination.

14

VertexMachine t1_jdzehvy wrote

Interesting. Potentially something that might be also used in the ongoing lawsuit against copilot?

2

rfxap t1_jdzfxd1 wrote

There are other benchmarks to look at though. Microsoft Research tried an early version of GPT-4 on LeetCode problems that were published after the training data cutoff date, and they got results similar to human performance in all difficulty categories: https://arxiv.org/abs/2303.12712 (page 21)

What should we make of that?

277

mrpickleby t1_jdzg5e8 wrote

Implies that AI will speed the dissemination of information but not necessarily be helpful in creating new thinking.

23

bjj_starter t1_jdzo3zq wrote

But that's pure speculation. They showed that a problem existed with training data, and OpenAI had already dealt with that problem and wasn't hiding it at all - GPT-4 wasn't tested on any of that data. Moreover, it's perfectly fine for problems like the ones it will be tested on to be in the training data, as in past problems. What's important is that what it's actually tested on is not in the training data. There is no evidence that it was tested on training data, at this point.

Moreover, the Microsoft Research team was able to repeat some impressive results in a similar domain on tests that didn't exist before the training data cut-off. There isn't any evidence that this is a problem with a widespread effect on performance. It's also worth noting that it seems pretty personal for the guy behind this paper, judging by the way he wrote his tweet.

3

bjj_starter t1_jdzoafq wrote

This title is misleading. The only thing they found was that GPT-4 was trained on code questions it wasn't tested on.

42

sigmatrophic t1_jdzpk9m wrote

Honestly I paid for GTP 4... It's a bit better but felt like gtp3 before they dumbed it down.

5

plocco-tocco t1_jdzpyf8 wrote

I do not see any evidence of this happening in the article. Also, OpenAI claims to have checked for contamination in every benchmark, so I don't see what the author's are trying to show here.

−2

visarga t1_jdzr4tp wrote

This paper scared me more than any other ML paper. I hoped we have 2-3 more years until what they show in there.

2

abc220022 t1_jdzrsbu wrote

Part of the sales pitch behind LeetCode is that you are working on problems that are used in real coding interviews at tech companies. I believe that most LeetCode problems were invented well before they were published on the LeetCode website, so they still could appear in some form in their training data.

350

milktoasttraitor t1_jdzuw0z wrote

If you look at the prompt they show, they clearly gave it hints which tell it the exact approach to use in order to solve the problem. The problem is also a very slight derivative of another existing, very popular problem on the platform (“Unique Paths”).

This is impressive in another way, but not in the way they were trying to show. They didn’t show the other questions it got right, so no way of telling how good or bad the methodology was overall or what hints they gave it. For that question at least, it’s not good and it makes me skeptical of the results.

20

thelastpizzaslice t1_jdzv7pu wrote

I once asked it for a parody of Miss American Pie about Star Wars Episode 1 and it gave me Weird Al's song verbatim.

9

londons_explorer t1_jdzwcfo wrote

Problems like this are never 100% novel.

There are always elements and concepts of the problem and solution that have been copied from other problems.

The easiest way to see this is to ask a non-programmer to come up with a 'programming puzzle'. They'll probably come up with something like "Make an app to let me know when any of my instagram friends are passing nearby and are up for hanging out".

Compare that to a typical leetcode problem, and you'll soon see how leetcode problems are really only a tiny tiny corner of what is possible to do with computers.

93

jrkirby t1_jdzx1ef wrote

I'm guessing the hard part is that you can't "untrain" a model. They hadn't thought "I want to benchmark on these problems later" when they started. Then they spent 20K$+ compute on training. Then they wanted to test it. You can easily find the stuff you want to test on in your training dataset, sure. But you can't so easily remove it and train everything again from scratch.

7

muskoxnotverydirty t1_je027xh wrote

Yeah it's speculation. I agree.

> There is no evidence that it was tested on training data, at this point.

I think what the author is trying to say is that for some of these tests there's no evidence it was tested on training data but there's no evidence that it wasn't. But then the ability to generalize in the specific domain of the tests depends on that difference. If nothing else, it would be nice for those who publish test results to show how much they knew whether test data was in the training data. It seems to me that they could automate a search within the training set to see if exact wordage is used.

11

Wtiaw t1_je04aq2 wrote

> Note that GPT-4 cannot access the Internet, so memorization is the only explanation

this is not true, it was shown through jailbreaks that it could access the internet

−5

truchisoft t1_je05j63 wrote

The funny thing about these posts is that this is clearly propaganda aimed to low effort people.

Anyone caring about this is either blinded by their own prejudice or just too dumb to even try GPT once themselves.

Everyone else does not need someone telling them that even GPT3.5 is incredible for coding (and a lot of other stuff), it is not perfect but it goes a long way, heck, I was even able to make a simple game in less than 3 hours using 99% GPT3.5 code and DALL-E sprites.

−10

ArnoF7 t1_je0dzqg wrote

Funnily, I actually found GPT-4 far worse than what I expected in terms of coding, especially after I looked at its impressive performance on other exams. I guess it’s still a progress in terms of LLM for coding, maybe just a little underwhelming compared to other standardized tests it aces? GPT-4’s performance on codeforces is borderline abhorrent.

And now you are telling me there is data leakage, so the actual performance would be even worse than what’s on paper???

5

austacious t1_je0g6oi wrote

A healthy skepticism in AIML from those in the field is incredibly important and relatively hard to come by. Having the attitude that 'This is great and everything is wonderful' does not lead to meaningful progress addressing very real issues. It's very productive to point out shortcomings of otherwise highly effective models.

12

cegras t1_je0g90p wrote

How does the AI perform any better than a Google search? I'd say the AI is even more dangerous as it gives a single, authoritative sounding answer that you have to go to Google and secondary sources to verify anyways!

12

MrFlamingQueen t1_je0j29h wrote

It feels like majority of the people in this discussion have no idea what computer science is and what LeetCode tests.

As you mentioned, there are hundreds of websites devoted to teaching the leetcode design patterns and entire books devoted to learning and practicing these problems.

33

ianitic t1_je0mjqx wrote

Oh I haven't tested this on textbooks, but I have asked chatGPT to give me pages of a novel and it did word for word. I suspect it had to have trained on PDFs? I'm highly surprised I haven't seen any news of authors/publishers suing yet tbh.

It is obvious when a book is a part of its training set or not though based on the above test.

10

ReasonablyBadass t1_je0rwr3 wrote

Is it possible the older questions are by now about better known problems so more training data existed for them and the newer are about newer concepts, not really represented on the net yet?

1

meister2983 t1_je0s90f wrote

GPT-4 is an extremely good pattern matcher - probably one of the best ever made. Most exams made seem to be able to executed with straight-forward pattern matching (with no backtracking). The same thing applies to basic coding questions - it reasonably performs at the level of a human gluing stack overflow solutions together (with the obvious variable renaming/moving lines around/removing dead code/etc.)

It struggles at logical reasoning (when it can't "pattern match" the logical reasoning to something it's trained on).

Coding example:

  • Had no problem writing a tax calculator for ordinary income with progressive tax brackets
  • It struggles to write a program to calculate tax on long term capital gains (US tax code), which is very similar to the above, except has an offset (you start bracket indexing at ordinary income). I'd think this is actually pretty easy for a CS student especially if they saw the solution above -- GPT4 struggled though as it doesn't really "reason" about code the way a human would and would generate solutions obviously wrong to a human.
14

nomadiclizard t1_je0u5ex wrote

Haha amateurs. I learned not to make that mistake when I split a pose estimation visual dataset into training and validation, but lots of the frames were almost-duplicates so it got contaminated that way. >.<

1

truchisoft t1_je0un8g wrote

Oh no no, not my argument here, but the whole title wording looks like a sleazy attack, this is not criticism but seems like a hit piece, since like other commenters mention, other independent tests were ran on GPT4 already and people is already using GPT4 for coding.

0

polygon_primitive t1_je0x04y wrote

For finding answers it's about the same as Google, sometimes better if you then verify the result with external sources, but that's mainly because Google has so badly corrupted their core search product while chasing profit. It's been pretty useful for me for doing the grunt work writing boiler plate code and refactoring stuff tho

3

visarga t1_je0zqxm wrote

ML people spend all day thinking about model limitations and errors, it's only normal that we are not so easily swayed by a non-peer reviewed paper declaring first contact with AGI. Especially from MS who owns 50% of OpenAI

6

thorax t1_je107vs wrote

I'm working on an extreme usage model for leveraging GPT4 to generate code, and it's rather good. Not perfect, but impressive is an understatement.

1

Puzzleheaded_Acadia1 t1_je11l0o wrote

So does that mean that gpt 4 can't think critically? and if not can we make a new kind of ML like LLMs and llama that can think critically and integrated to gpt 4 so it becomes a multimodel that can "see" and think critically.

0

DaBobcat t1_je12b4q wrote

Here OpenAI and Microsoft were evaluating GPT4 on medical problems. In section 6.2 they specifically said that they found strong evidence that it was trained on "popular datasets like SQuAD 2.0 and the Newsgroup Sentiment Analysis datasets". In the appendix section B they explain how they measured whether it saw something in the training data. Point is, I think benchmarks are quite pointless if the training dataset is private and no one can verify that they did not train it on the test set, which they specifically said that in many cases it did

5

currentscurrents t1_je12d3k wrote

Nobody knows exactly what it was trained on, but there exist several datasets of published books.

>I'm highly surprised I haven't seen any news of authors/publishers suing yet tbh.

They still might. But they don't have a strong motivation; it doesn't really directly impact their revenue because nobody's going to sit in the chatgpt window and read a 300-page book one prompt at a time.

6

currentscurrents t1_je13kdr wrote

True! But also, problems in general are never 100% novel. That's why metalearning works.

You can make up for poor reasoning abilities with lots of experience. This isn't bad exactly, but it makes testing their reasoning abilities tricky.

15

TheEdes t1_je149kf wrote

Yeah but if you were to come up with a problem in your head that didn't exist word for word then GPT-4 would be doing what they're advertising, however, if the problem was word for word anywhere in the training data then the testing data is contaminated. If the model can learn the design patterns for leetcode style questions by looking at examples of them, then it's doing something really good, if it can only solve problems that it has seen before, then it's nothing special, they just overfit a trillion parameters on a comparatively very small dataset.

8

currentscurrents t1_je14pi5 wrote

Clearly, the accuracy is going to have to get better before it can replace Google. It's pretty accurate when it knows what it's talking about, but if you go "out of bounds" the accuracy drops off a cliff without warning.

But the upside is that it can integrate information from multiple sources and you can interactively ask it questions. Google can't do that.

3

regalalgorithm t1_je1eu1e wrote

FYI, the GPT 4 paper has a whole section on contamination in the appendix - I found it to be pretty convince. Removing contaminatimg data did make it worse at some benchmarks, but also better at others, and overall it wasn't a huge effect.

1

notforrob t1_je1lowh wrote

This inspired me to ask GPT-4:
"Can you generate a leetcode easy problem that has never been seen?"

And then ask it to solve the problem it creates. In the few cases I tried it failed miserably.

1

HonkyTonkPolicyWonk t1_je1mqdp wrote

Well, yeah, ChatGTP is auto-suggest on steroids. It can’t create anything de novo. It reframes and regurgitates what others have done.

No surprises here

0

mlresearchoor t1_je1mvf7 wrote

OpenAI blatantly ignored the norm to not train on the ~200 tasks collaboratively prepared by the community for BIG-bench. GPT-4 knows the BIG-bench canary ID afaik, which removes the validity of GPT-4 eval on BIG-bench.

OpenAI is cool, but they genuinely don't care about academic research standards or benchmarks carefully created over years by other folks.

92

WarmSignificance1 t1_je1pdz9 wrote

I think that ChatGPT has shown how bad so many people are at Googling. And granted, sometimes ChatGPT is just far superior.

But when people say things like "I can ask it how to use a library and it's made me 10x faster over using Google", it just blows my mind. I can usually find the official docs and figure out how to use a library in about the same time as ChatGPT can tell me, without the risk of errors.

12

pale2hall t1_je2begu wrote

Chat GPT-4 can't remember it's writing a FireFox add-on not a Chrome Extension.

It's like the most amazing coder ever, but always half-drunk, and completely confident, and always. Here's how almost every single response started after the first....

  • Apologies for the incomplete response.
  • Apologies for the confusion. The Express server I provided earlier ...
  • I apologize for the inconvenience. After reviewing the code, I've noticed some inconsistencies in the code
  • I apologize for the confusion. It appears that the context menu was removed due to a typo in the content.js file.
  • I apologize for the confusion. To make the changes you requested, follow the steps below:
  • Apologies for the confusion, and thank you for providing the additional information. Here's an updated implementation that should resolve the issues:
  • I apologize for the confusion. Here's an updated solution that should display the response in the popup window and clear the input field on submit. Additionally, I added an indicator that shows the addon is thinking.
  • Apologies for the confusion, and thank you for the clarification. Based on your requirement, you can make the following changes:
  • Apologies for the confusion. You are correct that you cannot trigger the reviseMyComment() function in the content script without sending a message from the background script.
  • My apologies for the confusion. The error you are encountering is because the sendToOpenAI() function is not available in the content script content.js
  • Apologies for the confusion. I made an error in my previous response.
3

AquaBadger t1_je2c68z wrote

to be fair, google has gotten slower to find useful information due to the mass of ads and bought results clogging up searches now. But yes, google is still faster than chatgpt and if cleaned up would be even better

9

bjj_starter t1_je2ckb0 wrote

>If nothing else, it would be nice for those who publish test results to show how much they knew whether test data was in the training data.

Yes, we need this and much more information about how it was actually built, what the architecture is, what the training data was, etc. They're not telling us because trade secrets, which sucks. "Open" AI.

1

jrkirby t1_je2f63r wrote

2 million dollars or 20 million dollars is greater than 20 thousand. And it makes the main thesis more salient - the more money you've spent training, the less willing you'll be to retrain the entire model from scratch just to run some benchmarks the "proper" way.

1

trajo123 t1_je2gie9 wrote

How much of the code that devs write on a typical day is truly novel and not just a rehash / combination / adaptation of existing stuff?

He who has not copied code from stackoverflow, let him cast the first insult at ChatGPT.

0

cegras t1_je2k9dr wrote

ChatGPT is great at learning the nuances of english, i.e. synonyms and metaphors. But if you feed it a reworded leetcode question and it finds the answer within its neural net, has it learned to conceptualize? No, it just learned that synonym ...

8

pmirallesr t1_je2tf2v wrote

Idk, the procedure to check for contamination described in the release report sounded solid at first glance, and I don't see how this news changes that

1

SWESWESWEh t1_je33t7z wrote

I've had a lot more luck solving novel coding problems with the GPT-4 version of chatGPT then Google. If you stick to older tech and libraries like Java and Spring that have been around forever, it's really good at solving fairly difficult problems if you just keep providing context. With Google, it's basically has someone done this exact thing on SO and gotten an answer, if not oh well

2

Coffee_Crisis t1_je392lv wrote

If you search GitHub for unusual variable names or keywords you will often find code that looks very similar to the stuff GPT spits out, in some domains it’s much more copy paste than people think

1

salgat t1_je3eqx5 wrote

GPT4 is the world's best googler. As long as a similar solution existed on the internet in the past, there's a good chance GPT4 can pick it up, even if it's not on leetcode yet.

1

purplebrown_updown t1_je3xwqa wrote

Question. I’m guessing they want to continuously feed more data to gpt so how do they avoid using up all their training. Is this what’s called data leakage?

1

joeiyoma t1_je42q2t wrote

Chatgpt always have the potential for error, 4 version has a reduced potential for error. My biggest worry is what it will do our creativity. Autopilot all the time!

1

Calamero t1_je4doo0 wrote

It will enable creative people to bring their ideas to reality. It won’t make people less creative. AI technology democratizes the execution part, making it easier for people from all walks of life to transform their visions into reality. It will augment human creativity rather than stifling it.

1

WarmSignificance1 t1_je57s2a wrote

So I actually think that senior devs copy and paste a lot less than everyone imagines.

I can’t remember the last time I’ve copied code from StackOverflow. Actually, I rarely even use StackOverflow at this point. Going directly to the official docs is always best.

1

TheEdes t1_je6tweq wrote

Sure but what's being advertised isn't sentience per se, at least with the leetcode part of their benchmarks. The issue here is that they claim that it can do X% on leetcode, but it seems like it's much less on new data. Even if it learned to find previous solutions and replace it with changes it should be able to perform well due to the nature of the problems.

1

Nhabls t1_je93uvg wrote

The way they defined human performance there is just funny.

Dividing the number of accepted answers by total users.. might as well just make up a number

1

Nhabls t1_je94xwx wrote

Not misleading. The fact it performs so differently on easy problems it has seen Vs not , specially when it fails so spectacularly on the latter does raise big doubts about how corrupted and unreliable their benchmarks might be

6

bjj_starter t1_je98wdx wrote

Okay, but an external team tested it on coding problems which only came into existence after its training finishes, and found human level performance. I don't think your theory explains how that could be the case.

1

Nhabls t1_je9anrq wrote

Which team is that? The one at Microsoft that made up the human performance figures in a completely ridiculous way? Basically "We didn't like that pass rates were too high for humans for the hard problems that the model fails on completely so we just divided the accepted number by the entire user base" oh yeah brilliant

The "human" pass rates are also composed of people learning to code trying to see if their solution works. Its a completely idiotic metric, why not go test randos on the street and declare that represents the human coding performance metric while we're at it

1

joeiyoma t1_je9fegm wrote

There is a lot of Buzz about prompt engineering, can it cut as a skill-set going forward or is just a hype that will out with time.

1