Submitted by ortegaalfredo t3_11kr20f in MachineLearning
ortegaalfredo OP t1_jbaaqv5 wrote
Reply to comment by SrPeixinho in [R] Created a Discord server with LLaMA 13B by ortegaalfredo
The most important thing is to create a multi-process quantization to int8, this will allow it to work with 4X3090 GPU cards. Now it requires 8X3090 GPUs and its way over my budget.
Or just wait some days, I'm told some guys have 2xA100 cards and they will open a 65B model to the public this week.
SpaceCockatoo t1_jblj2so wrote
4bit quant already out
ortegaalfredo OP t1_jbov7dl wrote
Tried the 8bit, 4bit for some reason don't work yet for me.
Problem is, those are very very slow, about 1 token/sec, compared with 13B I'm getting 100 tokens/s
Viewing a single comment thread. View all comments