andreichiffa t1_j6n9lg6 wrote on January 31, 2023 at 3:22 PM

Reply to comment by visarga in Few questions about scalability of chatGPT [D] by besabestin

A lot of the conclusions from that paper has been called into question by the discovery GPT-2 was actually memorizing a lot of information from the training dataset a little less than a year later: https://arxiv.org/abs/2012.07805

About a year after that Anthropic came out with a paper that suggested that there were scaling laws that meant undertrained larger models did not that much better and actually did need more data: https://arxiv.org/pdf/2202.07785.pdf

Finally, more recent results from DeepMind did an additional pass on the topic and seem to suggest that the relationship between the data and model size is much more tight than anticipated and that a 4x smaller model trained for 4x the time would out-perform the larger model: https://arxiv.org/pdf/2203.15556.pdf

Basically the original OpenAI paper did contradict a lot of prior research on overfitting and generalization and seems to be due to a Simpson paradox instance on some of the batching they were doing.