tetrisdaemon

tetrisdaemon OP t1_izjp9nc wrote

I'm looking into it, but I'm guessing it's the CLIP embeddings, so disentanglement might need to happen at that level. Some supporting evidence is that even if we set the cross attention to zero (for some words), it'll still reflect in the final image, indicating that the word representations are mixed in CLIP.

2

tetrisdaemon OP t1_izi47x8 wrote

For sure, and how linguistics can guide Stable Diffusion to produce better images. For example, if we already understand how objects should relate on the language side (e.g., "a giraffe and a zebra" should probably produce two distinct animals, unlike that observed in the paper), we can twiddle the attention maps so that the giraffe and the zebra are separate.

3