Microsoft had a relevant paper a few months back that was pretty good and quite relevant. They also reported seeing smaller models outperform larger ones post-distillation:
"In terms of accuracy, we observe in the experiments from section 3.3 that the in-house models trained with GPT-3 labels can often outperform raw GPT-3. We argue that by using data labeled by GPT-3, we are essentially performing self-training: the predictions on unlabeled samples act as regularization on induced models and help improve the performance."
Not the same approach of combining multiple sources, but a similar flavor.
Ha! If I was trying to pretend no affiliation, u/learn-deeply/ I probably wouldn't have a username literally matching the author string of the post?
You may also want to give it another read—the GPT-3 models are fine-tuned, that's the point! (The GPT-3 zero-shot baseline that I assume you're referencing is mentioned once as a curiosity but not compared to beyond that). You can even look at the full cross-product of fine-tuning RoBERTa vs GPT-3 on GT labels vs weak labels. With the larger training sets—the distilled and combined set of ~60k—they score essentially identically (within 0.1 point). i.e. you simply don't need all that GPT-3 capacity; all you need is the relevant information that it has for your problem.
bradenjh OP t1_ixt8qeu wrote
Reply to comment by farmingvillein in [R] Getting GPT-3 quality with a model 1000x smaller via distillation plus Snorkel by bradenjh
"Ground Truth" or manual labels from an expert (as opposed to labels created programmatically with weak supervision).