Submitted by cloneofsimo t3_zfkqjh in MachineLearning

​

TLDR : People uses dreambooth or textual inversion to fine-tune their own stable diffusion models. There is a better way: Use LoRA to fine-tune twice as faster, with end result being less than 4MB. Dedicated CLI, package, and pre-trained models are available at https://github.com/cloneofsimo/lora

​

fine tuned LoRA on pixar footages. Inspired by modern-disney-diffusion

​

fine tuned LoRA on pop-art style

Thanks to the generous work of Stability AI and Huggingface, so many people have enjoyed fine-tuning stable diffusion models to fit their needs and generate higher fidelity images. However, the fine-tuning process is very slow, and it is not easy to find a good balance between the number of steps and the quality of the results.

Also, the final results (fully fined-tuned model) is rather very large. Consequently, merging checkpoints to find user's best fit is painstakingly SSD-consuming process. Some people instead works with textual-inversion as an alternative for this. But clearly this is suboptimal: textual inversion only creates a small word-embedding, and the final image is not as good as a fully fine-tuned model.

I've managed to make an alternative work out pretty well with Stable Diffusion: adapters. Parameter-efficient adapation has been a thing for quite a long time now. Mainly, LoRA seems to work robustly in many scenarios according to many researches. (https://arxiv.org/abs/2112.06825, https://arxiv.org/abs/2203.16329)

LoRA was originally proposed as part of LLM's method, but this is rather model-agnostic method, as long as there is some space for low-rank tensor decomposition (which literally every linear layer has). No one seems to have tried them on Stable diffusion, other than perhaps (not sure if they did, because tey used other form of adapters) NovelAI, known as hypernetworks.

# But is it really good though?

I've tried my best to validate my answer : Yes. it's sometimes even better than fully fine-tuning. Note that even though we are fine-tuning 3MB of parameters, being even better than fully fine-tuning is not surprising : original paper's benchmark had similar results.

What do I mean by better? Well I could've used zero-shot FID score on some shifted dataset, but that would literally take years as generating 50,000 images on single 3090 device takes forever.

Instead, I've used Kernel Inception Distance (https://arxiv.org/abs/1801.01401) that has small standard deviation which I can reliably use as a metric. For the shifted dataset, I've gathered 2358 icon images and fine tuned them on 12000 steps for both fully fine-tuning and LORA fine-tuning. The end result is as follows:

​

LoRA 0.5 stands for merging only half of LoRA into original model. All initiated from Stable Diffusion version 2.0.

LoRA clearly wins full fine-tuning in terms of KID. But in the end, perceptual results are all that matters and I think end users will prove their effectiveness. I haven't had enough time to play with these to conclusively say anything about their superiority, but I did train LoRA on 3 different datasets (vector illustrations, disney style, pop-art style) which is available in my repo. End results seems pleasing enough to validate the perceptual quality.

# How fast is it?

Tested on 3090 device with 5950x cpu, LoRA takes 36 min on 12000 steps, while fully fine-tuning takes 1 hour 20 min. This is more than twice the speed. You also get to keep much of Adam memory saved + much of the parameters don't require grad so that's extra vram saved also.

Contributions are welcomed! This repo has been tested on Linux device, so if something doesn't work please leave a Issue/PR.If you've managed to train your own LoRA model, please share them!

114

Comments

You must log in or register to comment.

LetterRip t1_izdam40 wrote

Just tried this and it ran great on a 6GB VRAM card on a laptop with only 16GB of RAM (barely fit into VRAM - using bitsnbytes and xformers I think). I've only tried the corgi example but seemed to work fine. Trying it with a person now.

9

LetterRip t1_izdm55i wrote

> Glad it worked for you with such small memory constraints!

Currently training image size 768, and accumulation steps=2.

If steps is set to 2000, will it be going to 4000? It didn't stop at 2000 as expected and is currently over 3500, figured I'd wait till over 4000 to kill it in case the accumulation steps acts as a multiplier. (Went to 3718 and quit, right after I wrote the above).

2

hentieDesu t1_izebuz3 wrote

Can you train the model with pics of people's faces like the original Dreambooth?

I will give it a try regardless. Thx! I'll update you guys with the results.

3

ThatInternetGuy t1_izenxjo wrote

This could be a great choice between textual inversion and a full-blown Dreambooth. I think it could benefit from saving the text encoder too (about 250MB half-precision).

1

johnslegers t1_izexocv wrote

End result being less than 4MB?

So this means the finetuned content is saved separately?

What if I don't want that? What if I want it to be merged with the model, as is the case for Dreambooth training?

Is there a way to merge the trained concept with the model itself?

1

yupignome t1_izfd87n wrote

this looks great, but needs more documentation, as running it as it is doesn't work

1

sam__izdat t1_iziau6e wrote

What are you having trouble following? I'm not trying to be rude, but it's already a -less technical- method because HF's diffusers and accelerate stuff will download everything for you and set it all up. I rather it was a little more technical, because it's a bit of a black box.

I was having problems with unhelpful error messages until I updated transformers. I'm still having CUDA illegal memory access errors at the start of training, but I think that's because support for old Tesla GPUs is just fading -- had the same issue with new pytorch trying to run any SD in full precision.

1

LetterRip t1_izksf4k wrote

It is working, but I need to use prior preservation loss, otherwise all of the words in the phrase have the concept bleed into them. So generating photos for preservation loss now.

1

Desuka15 t1_izoh1qo wrote

Can you help me with this? I’m a bit lost on it. Please pm me.

1

[deleted] t1_j0e29x1 wrote

Yeah so did I but there's a fuckton of knowledge out there and it gets overwhelming and confusing for new people trying to figure it out, what a fucking dick answer to just be like "go look for it yourself"

1