Submitted by hapliniste t3_yck1sx in MachineLearning
With the success of diffusion models in image generation, I was wondering if doing the same but with text embeddings would make sense.
Diffusing the embeddings so they end up being a bit off in term of vectors and position and learning to correct them. Also the iterative process of refining them during multiple pass.
Would that make any sense? I don't think I heard about research in this area.
fastglow t1_itmotdk wrote
It's been applied to text-to-speech: https://arxiv.org/abs/2104.01409