KerfuffleV2 t1_jc1jtg5 wrote on March 13, 2023 at 11:13 AM

I didn't want to clutter up the issue here: https://github.com/BlinkDL/ChatRWKV/issues/30#issuecomment-1465226569

In case this information is useful for you:

strategy	time	tps	tokens
`cuda fp16 0+ -> cuda fp16 10`	45.44	1.12	51
`cuda fp16 0+ -> cuda fp16 5`	43.73	0.94	41
`cuda fp16 0+ -> cuda fp16 1`	52.7	0.83	44
`cuda fp16 0+ -> cpu fp32 1`	59.06	0.81	48
`cuda fp16i8 12 -> cuda fp16 0+ -> cpu fp32 *1`	65.41	0.69	45

I ran the tests using this frontend: https://github.com/oobabooga/text-generation-webui

It was definitely using rwkv version 0.3.1

env RKWV_JIT_ON=1 python server.py \ 
  --rwkv-cuda-on \ 
  --rwkv-strategy STRATEGY_HERE \ 
  --model RWKV-4-Pile-7B-20230109-ctx4096.pth

For each test, I let it generate a few tokens first to let it warm up, then stopped it and let it generate a decent number. Hardware is a Ryzen 5 1600, 32GB RAM, GeForce GTX 1060 6GB VRAM.

Surprisingly, streaming everything as fp16 was still faster than putting 12 fp16i8 layers in VRAM. A 1060 is a pretty old card, so maybe it has unusual behavior dealing with that format. I'm not sure.

bo_peng OP t1_jc2alfm wrote on March 13, 2023 at 3:03 PM

Try rwkv 0.4.0 & latest ChatRWKV for 2x speed :)

KerfuffleV2 t1_jc3jith wrote on March 13, 2023 at 7:54 PM

> Try rwkv 0.4.0 & latest ChatRWKV for 2x speed :)

Nice, that makes a big difference! (And such a small change too.)

The highest speed I've seen so far is with something like cuda fp16i8 *15+ -> cuda fp16 *1 at about ~~1.21tps~~ edit: I was mistaken, it was actually 1.17. Even cuda fp16i8 *0+ gets quite acceptable speed (.85-.88tps) and uses around 1.3GB VRAM.

I saw your response on GitHub. Unfortunately, I don't use Discord so hopefully it's okay to reply here.

bo_peng OP t1_jc9gf72 wrote on March 15, 2023 at 6:25 AM

Update ChatRWKV v2 & pip rwkv package (0.5.0) and set os.environ["RWKV_CUDA_ON"] = '1'

for 1.5x speed f16i8 (and 10% less VRAM, now 14686MB for 14B instead of 16462M - so you can put more layers on GPU)

KerfuffleV2 t1_jcadn3g wrote on March 15, 2023 at 1:01 PM

Unfortunately, it doesn't compile for me: https://github.com/BlinkDL/ChatRWKV/issues/38

I'm guessing even if you implement special support for lower compute versions that will probably cancel out the speed (and maybe size) benefits.

bo_peng OP t1_jcb05e8 wrote on March 15, 2023 at 3:36 PM

stay tuned :) will fix it

KerfuffleV2 t1_jccb5v1 wrote on March 15, 2023 at 8:26 PM

Sounds good! The 4bit stuff seems pretty exciting too.

By the way, not sure if you saw it but it looks like PyTorch 2.0 is close to being released: https://www.reddit.com/r/MachineLearning/comments/11s58n4/n_pytorch_20_our_next_generation_release_that_is/

They seem to be claiming you can just drop in torch.compile() and see benefits with no code changes.

bo_peng OP t1_jccc46c wrote on March 15, 2023 at 8:32 PM

I am using torch JIT so close ;)