muskoxnotverydirty
muskoxnotverydirty t1_jdzi41h wrote
Reply to comment by Simcurious in [N] OpenAI may have benchmarked GPT-4’s coding ability on it’s own training data by Balance-
It's correct and it's not correct. The article mentions this, but then they say that it's likely that they weren't able to cleanly separate pre-2021 questions on non-coding benchmarks.
muskoxnotverydirty t1_jdwjc1w wrote
Reply to comment by tamilupk in [D] Will prompting the LLM to review it's own answer be any helpful to reduce chances of hallucinations? I tested couple of tricky questions and it seems it might work. by tamilupk
And this method doesn't have some of the drawbacks seen in OP's prompting. Giving an example of an incorrect response followed by self-correction within the prompt may make it more likely that the initial response is wrong, since that's the pattern you're showing it.
muskoxnotverydirty t1_jdw39vd wrote
Reply to comment by [deleted] in [D]GPT-4 might be able to tell you if it hallucinated by Cool_Abbreviations_9
How so?
muskoxnotverydirty t1_jdvak20 wrote
Reply to comment by Borrowedshorts in [D]GPT-4 might be able to tell you if it hallucinated by Cool_Abbreviations_9
We've already seen similar prompts such as telling it to say "I don't know" when it doesn't know, and then priming it with examples of it saying "I don't know" to nonsense. Maybe there's something to the added work of getting an output and then iteratively self-critiquing to get to a better final output.
I wonder if they could be using this idea to automatically and iteratively generate and improve their training dataset at scale, which would create a sort of virtuous cycle of improve dataset -> improve LLM -> repeat.
muskoxnotverydirty t1_jdv9m5v wrote
Reply to comment by was_der_Fall_ist in [D]GPT-4 might be able to tell you if it hallucinated by Cool_Abbreviations_9
"Temperature" governs this behavior, doesn't it? I was under the impression that when you set temperature to zero, you get a deterministic output because it always selects the most probable token.
muskoxnotverydirty t1_jds07qr wrote
Reply to comment by nixed9 in [D] GPT4 and coding problems by enryu42
Eh, I must've misunderstood the paper. It sounded like they were asking GPT4 to create unit tests, execute the code, and then update its answer based on the results of those unit tests.
muskoxnotverydirty t1_je027xh wrote
Reply to comment by bjj_starter in [N] OpenAI may have benchmarked GPT-4’s coding ability on it’s own training data by Balance-
Yeah it's speculation. I agree.
> There is no evidence that it was tested on training data, at this point.
I think what the author is trying to say is that for some of these tests there's no evidence it was tested on training data but there's no evidence that it wasn't. But then the ability to generalize in the specific domain of the tests depends on that difference. If nothing else, it would be nice for those who publish test results to show how much they knew whether test data was in the training data. It seems to me that they could automate a search within the training set to see if exact wordage is used.