Hi,

Is there any way to run llama (or any other) model in such a way, that you only pay per API request?

I wanted to test how the llama model would do in my specific usecase, but when I went to HF Interface Endpoints it says that I would have to pay over 3k USD per month (ofc I do not have that much money to spend on a side-project).

I would like to test this model by paying on per request basis.

Comments

currentscurrents t1_jcqzjil wrote on March 18, 2023 at 9:57 PM

#2,265,819

I haven't heard of anybody running LLama as a paid API service. I think doing so might violate the license terms against commercial use.

>(or any other) model

OpenAI has a ChatGPT API that costs pennies per request. Anthropic also recently announced one for their Claude language model but I have not tried it.

VelvetyPenus t1_jcr1usl wrote on March 18, 2023 at 10:14 PM

#2,265,922

Wait two weeks, it will all be free.

Philpax t1_jcrgxbb wrote on March 19, 2023 at 12:09 AM

#2,266,615

As the other commenter said, it's unlikely anyone will advertise a service like this as LLaMA's license terms don't allow for it. In your situation, I'd just rent a cloud GPU server (Lambda Labs etc) and test the models you care about. It'll only end up being a dollar or two if you're quick with your use.

NotARedditUser3 t1_jcsc9lp wrote on March 19, 2023 at 4:31 AM

#2,268,247

You can get llama running on consumer grade hardware. There's 4 and 8 bit quantization for it i believe where it fits in a normal gpu's vram, i saw floating around here

veonua t1_jct3plc wrote on March 19, 2023 at 10:39 AM

#2,269,300

As far as I know, the Meta license forbids this, since the model is for academic purposes only

veonua t1_jct419t wrote on March 19, 2023 at 10:44 AM

#2,269,310

Replying to currentscurrents (#2,265,819)

Creating a monopoly on AI can be extremely risky. Although OpenAI was founded to prevent it, recent actions by the company suggest that they may be contributing to monopolization by reducing prices.

tomd_96 t1_jctddsu wrote on March 19, 2023 at 12:35 PM

#2,269,729

You can do this using replicate: https://github.com/replicate/cog-llama

danielbln t1_jctk7sy wrote on March 19, 2023 at 1:39 PM

#2,270,104

Replying to currentscurrents (#2,265,819)

Pennies per request would be a lot, it's a fraction of a penny per request.

MBle OP t1_jcujt4a wrote on March 19, 2023 at 5:53 PM

#2,271,800

Replying to VelvetyPenus (#2,265,922)

Based on what information you predict this?

iKlsR t1_jcw7jms wrote on March 20, 2023 at 12:58 AM

#2,275,271

Replying to MBle (#2,271,800)

Maybe based on how fast things have been moving recently... relevant https://replicate.com/blog/llama-roundup