drd13 t1_j1h3gvy wrote
Reply to comment by [deleted] in [R] Nonparametric Masked Language Modeling - MetaAi 2022 - NPM - 500x fewer parameters than GPT-3 while outperforming it on zero-shot tasks by Singularian2501
Similarly to T5 (abd Bert) the model is pre-trained by predicting some randomly masked spans of words. However the way these spans of words are predicted is different.
In T5, masked words are generated one-by-one autoregressively (i.e. use a softmax over vocabulary to generate words one by one). Here a set of candidate possible spans, covering your whole trained corpus is preliminarily created and the model looks at all the candidate spans and chooses the one it thinks is the best (using a contrastive loss).
Viewing a single comment thread. View all comments