royalemate357 t1_jb1h7wl wrote
Reply to comment by Art10001 in [R] RWKV (100% RNN) can genuinely model ctx4k+ documents in Pile, and RWKV model+inference+generation in 150 lines of Python by bo_peng
hmm I very much doubt it couldve ran 100x faster for the same parameter count, as you are memory bandwith bound (both GPT and RWKV have to load the parameters n times to generate n tokens). Also Im somewhat skeptical that you only need 3GB for 14B parameters *without offloading the model*, as even 4-bit quantization is 14B/2 = 7GB needed. and offloading the model is slow to the point of being unusable as you need to do CPU<->GPU transfers.
xx14Zackxx t1_jb1zk8v wrote
It depends on the context length. Since Attention scales n^2, and rnn scales in n, based on document length the speed up is a factor of n. Now, there are also some slow downs. I am certain his RNN solution here has to do some tricks which are more complex than just a simple rnn. But the longer the context, the faster the speed up relative to a transformer. So 100x on a large doc is not necessarily impossible (at least at inference time).
I have a hard time believing the memory claims as well though. Again, I really wish the author would write a paper about it. Because as far as I can see, if he’s using standard back propagation through time to train, the memory requirements should likely be quite dramatic. But again, I think he’s doing something special with his RNN, I just don’t know what it is.
Viewing a single comment thread. View all comments