Submitted by Dr_Singularity t3_ywdsks in singularity
Dr_Singularity OP t1_iwj0ct4 wrote
It Delivers Near Perfect Linear Scaling for Large Language Models
94746382926 t1_iwk1qrv wrote
Linear speed up in training time, not necessarily in performance. Just wanted to mention that as it's an important distinction.
visarga t1_iwkbncq wrote
One Cerebras chip is about 100 top GPUs in speed but in memory it only handles 20B weights, they mention GPT-NeoX 20B. They need to stack 10 of these to train GPT-3.
Lorraine527 t1_iwxm2oj wrote
Gpu's has much more memory per core and that's needed for language models.
Viewing a single comment thread. View all comments