_Arsenie_Boca_ t1_jb0sm2c wrote on March 5, 2023 at 3:49 PM

I have been following your reddit posts for some while now, but I still dont think I fully understand it. Did you consider writing a paper? It might help people get the method and might fuel the open source help you get.

luxsteele t1_jb1b68d wrote on March 5, 2023 at 5:59 PM

Totally agree.

I have been following this from some time but I can't fully understand it and explain it to my collaborators.

I work in ML and I have quite some experience with transformers and I still can't fully get it. Let alone convince some of my collaborator that is worth pursuing it.

It is paramount that we have a paper that explains this in more detail if we want the community to consider this seriously.

Please do it!

bo_peng OP t1_jb1q5fu wrote on March 5, 2023 at 7:38 PM

Yes a paper is coming. Meanwhile you can read https://arxiv.org/abs/2302.13939 (SpikeGPT) which is inspired by RWKV and has plenty of explanations :)

bo_peng OP t1_jb1po7i wrote on March 5, 2023 at 7:34 PM

Will the 150 lines help? Please read the code first :)

https://github.com/BlinkDL/ChatRWKV/blob/main/RWKV_in_150_lines.py

This is ALL you need for RWKV inference.

And you can read https://arxiv.org/abs/2302.13939 (SpikeGPT) which is inspired by RWKV and has plenty of explanations :)

_Arsenie_Boca_ t1_jb1wjfi wrote on March 5, 2023 at 8:22 PM

It does help but certainly doesnt make everything clear. I am confident I could run inference on it, but my interest is rather academic than practical.

What is the magic number 5 all about? It seems to appear all over the code without explanation.

Are the time mixing and channel mixing operations novel or were they introduced by a citable work?

How does the parallelization during training work?

bo_peng OP t1_jb1z3an wrote on March 5, 2023 at 8:40 PM

5 is the number of hidden states per block (4 for ATT = xx aa bb pp, 1 for FFN = xx).

TimeMixing is RWKV.

ChannelMixing is your usual FFN (sqReLU as in Primer paper) with an extra R-gate (Novel. I find it helps).

Parallelization is due to https://github.com/BlinkDL/RWKV-LM/raw/main/RWKV-formula.png.