gmork_13

gmork_13 t1_je9e6wu wrote

I'm wondering the same thing.
In the LoRA paper they had some pros vs cons on other adapters (where LoRA won out). Though you technically could do both, you'd probably pick one.

Indeed, this adapter wins out vs LoRA when looking at weight size, but since we're talking about MB it's an almost negligible difference (in this scenario). It's a shame they didn't include LoRA training time in their comparison.

They say 1hr on 8*A100, whereas the alpaca-LoRA github says 4-5 hrs on 1*4090.
8*A100's is 640GB VRAM (assuming 80GB model) as opposed to the 4090's 24GB - there are also differences in speed and the fact that the alpaca-LoRA github may have run the inference on an 8bit quantized model.

Since the adapter paper says nothing about quantization, I'm assuming it's 640GB VRAM used for the full fp32 7B model for one hour (or fp16?), compared to the alpaca-LoRA git which runs 24GB VRAM on 8int 7B model for 4.5 hrs.

They both train on the stanford dataset, but alpaca-LoRA-git trains for 3 epochs on the cleaned dataset whereas llama-adapter trains on the full dataset for 5 epochs.
That's a lot of small differences to account for if you're trying to figure out what's faster.
It can be done, but the question remains whether the end result is comparable and whether it was trained to an optimal point.

Since the authors trained alpaca-LoRA, why didn't they write how long alpaca-LoRA took in their comparison table? They trained on the same hardware and dataset, I assume.

If the only difference between this adapter and others is, as they mention in the paper, the gating, zero init and multi-modality then the downsides mentioned in the LoRA paper might still hold (bottlenecks). I'm no expert though.

8

gmork_13 t1_je7h3rc wrote

Having started with TF and moved to torch myself, torch was just easier to work with when doing something a bit out of the ordinary. Since then it has gained in popularity and with popularity comes lots of walkthroughs, documentation, videos guides and research papers with github repos.

1

gmork_13 t1_je7fmm8 wrote

I'm assuming you don't mean missing values in your dataset.

  1. You can create 'missing' data, but if you create the missing data out of the data you already give to the model you're sort of doing the work for it. For compute efficient reasons you might want to avoid giving it 'unnecessary' data. What is unnecessary can be hard to define. Think about what you want the model to grasp in the first place.

  2. I'm not sure what you mean by performing a test. If you were to train a language model the context of the word would define its meaning. You can always take the output probs of a model and do something with that if you'd like (for instance, if it's lots of low probability alternatives - do something).

1

gmork_13 t1_jc6e3ox wrote

The way I was going to implement it with the chatgpt API was to store the conversation and have the model itself extract keywords of the conversation so far as it neared the token limit.

Then you can inject the keywords and search the previous conversation.

But this is still nothing like truly extending the actual memory of the model.

1