bradenjh

bradenjh OP t1_ixf191u wrote

Microsoft had a relevant paper a few months back that was pretty good and quite relevant. They also reported seeing smaller models outperform larger ones post-distillation:

"In terms of accuracy, we observe in the experiments from section 3.3 that the in-house models trained with GPT-3 labels can often outperform raw GPT-3. We argue that by using data labeled by GPT-3, we are essentially performing self-training: the predictions on unlabeled samples act as regularization on induced models and help improve the performance."

Not the same approach of combining multiple sources, but a similar flavor.

4

bradenjh OP t1_ixeyajh wrote

Ha! If I was trying to pretend no affiliation, u/learn-deeply/ I probably wouldn't have a username literally matching the author string of the post?

You may also want to give it another read—the GPT-3 models are fine-tuned, that's the point! (The GPT-3 zero-shot baseline that I assume you're referencing is mentioned once as a curiosity but not compared to beyond that). You can even look at the full cross-product of fine-tuning RoBERTa vs GPT-3 on GT labels vs weak labels. With the larger training sets—the distilled and combined set of ~60k—they score essentially identically (within 0.1 point). i.e. you simply don't need all that GPT-3 capacity; all you need is the relevant information that it has for your problem.

3