Viewing a single comment thread. View all comments

KerfuffleV2 t1_jc1jtg5 wrote

/u/bo_peng

I didn't want to clutter up the issue here: https://github.com/BlinkDL/ChatRWKV/issues/30#issuecomment-1465226569

In case this information is useful for you:

strategy time tps tokens
cuda fp16 *0+ -> cuda fp16 *10 45.44 1.12 51
cuda fp16 *0+ -> cuda fp16 *5 43.73 0.94 41
cuda fp16 *0+ -> cuda fp16 *1 52.7 0.83 44
cuda fp16 *0+ -> cpu fp32 *1 59.06 0.81 48
cuda fp16i8 *12 -> cuda fp16 *0+ -> cpu fp32 *1 65.41 0.69 45

I ran the tests using this frontend: https://github.com/oobabooga/text-generation-webui

It was definitely using rwkv version 0.3.1

env RKWV_JIT_ON=1 python server.py \ 
  --rwkv-cuda-on \ 
  --rwkv-strategy STRATEGY_HERE \ 
  --model RWKV-4-Pile-7B-20230109-ctx4096.pth 

For each test, I let it generate a few tokens first to let it warm up, then stopped it and let it generate a decent number. Hardware is a Ryzen 5 1600, 32GB RAM, GeForce GTX 1060 6GB VRAM.

Surprisingly, streaming everything as fp16 was still faster than putting 12 fp16i8 layers in VRAM. A 1060 is a pretty old card, so maybe it has unusual behavior dealing with that format. I'm not sure.

1

bo_peng OP t1_jc2alfm wrote

Try rwkv 0.4.0 & latest ChatRWKV for 2x speed :)

2

KerfuffleV2 t1_jc3jith wrote

> Try rwkv 0.4.0 & latest ChatRWKV for 2x speed :)

Nice, that makes a big difference! (And such a small change too.)

The highest speed I've seen so far is with something like cuda fp16i8 *15+ -> cuda fp16 *1 at about 1.21tps edit: I was mistaken, it was actually 1.17. Even cuda fp16i8 *0+ gets quite acceptable speed (.85-.88tps) and uses around 1.3GB VRAM.

I saw your response on GitHub. Unfortunately, I don't use Discord so hopefully it's okay to reply here.

1

bo_peng OP t1_jc9gf72 wrote

Update ChatRWKV v2 & pip rwkv package (0.5.0) and set os.environ["RWKV_CUDA_ON"] = '1'

for 1.5x speed f16i8 (and 10% less VRAM, now 14686MB for 14B instead of 16462M - so you can put more layers on GPU)

2

KerfuffleV2 t1_jcadn3g wrote

Unfortunately, it doesn't compile for me: https://github.com/BlinkDL/ChatRWKV/issues/38

I'm guessing even if you implement special support for lower compute versions that will probably cancel out the speed (and maybe size) benefits.

1

bo_peng OP t1_jcb05e8 wrote

stay tuned :) will fix it

2

KerfuffleV2 t1_jccb5v1 wrote

Sounds good! The 4bit stuff seems pretty exciting too.

By the way, not sure if you saw it but it looks like PyTorch 2.0 is close to being released: https://www.reddit.com/r/MachineLearning/comments/11s58n4/n_pytorch_20_our_next_generation_release_that_is/

They seem to be claiming you can just drop in torch.compile() and see benefits with no code changes.

1

bo_peng OP t1_jccc46c wrote

I am using torch JIT so close ;)

1