satireplusplus t1_jcp6bu4 wrote on March 18, 2023 at 2:20 PM

Reply to comment by FallUpJV in [R] RWKV 14B ctx8192 is a zero-shot instruction-follower without finetuning, 23 token/s on 3090 after latest optimization (16G VRAM is enough, and you can stream layers to save more VRAM) by bo_peng

This model uses a "trick" to efficiently train RNNs at scale and I still I have to take a look to understand how it works. Hopefully the paper is out soon!

Otherwise size is what matters! To get there it's a combination of factors - the transformer architecture scales well and was the first architecture that allowed to train these LLMs cranked up to enormous sizes. Enterprise GPU hardware with lots of memory (40G, 80G) and frameworks like pytorch that make parallelizing training across multiple GPUs easy.

And OPs 14B model might be "small" by today's standard, but its still gigantic compared to a few years ago. It's ~27GB of FP16 weights.

Having access to 1TB of preprocessed text data that you can download right away without doing your own crawling is also neat (pile).