juniperking t1_j4mma6c wrote on January 16, 2023 at 8:03 PM

Reply to comment by chaosmosis in [D] The Illustrated Stable Diffusion (Video) by jayalammar

>General comment: it's surprising to me that there aren't any instabilities introduced by stapling models together like this. If someone had come up to me with this description of an architecture several years ago, I would have told them that it was too complicated to work. Not sure what about my intuitions I should change in response to observing that this works despite them.

probably the most important thing that makes model configurations like this work is that they're very large and generalizable. a lot of prior research often focuses on finetuning for a specific task or dataset but the fact that clip (for example) is able to learn generalized text + image embeddings across multiple domains helps downstream training work