Viewing a single comment thread. View all comments

rjromero t1_j12aza8 wrote

> We use the model architecture and initial weights of RoBERTa large (Liu et al., 2019), consisting of 354M parameters. Training is done for 100,000 steps, using thirty-two 32GB GPUs.

354M parameters? At FP32 that's 1.41gb. It's tiny.

55

vwings t1_j13pguc wrote

It was expected, right? A retrieval system should be much more efficient than storing phrases in neural net weights as GPT does...

6