drd13 t1_j1h3gvy wrote on December 24, 2022 at 8:00 AM

Reply to comment by [deleted] in [R] Nonparametric Masked Language Modeling - MetaAi 2022 - NPM - 500x fewer parameters than GPT-3 while outperforming it on zero-shot tasks by Singularian2501

Similarly to T5 (abd Bert) the model is pre-trained by predicting some randomly masked spans of words. However the way these spans of words are predicted is different.

In T5, masked words are generated one-by-one autoregressively (i.e. use a softmax over vocabulary to generate words one by one). Here a set of candidate possible spans, covering your whole trained corpus is preliminarily created and the model looks at all the candidate spans and chooses the one it thinks is the best (using a contrastive loss).