In Image Captioning, to train the model, you have to provide any text that describe the images. By this definition, "the prompt that makes the image" does FALL IN. One text can produce many images. One image can be described by many texts. Image and Text have many2many relationships.
For example, to capture a picture of a running dog, people can describe the whole process. That still a caption.
For example, I prompt "running dog". Dalle 2 draws a running dog me. Yes that's a freaking caption.
KingsmanVince t1_irvmgnx wrote
Reply to comment by MohamedRashad in [D] Reversing Image-to-text models to get the prompt by MohamedRashad
In Image Captioning, to train the model, you have to provide any text that describe the images. By this definition, "the prompt that makes the image" does FALL IN. One text can produce many images. One image can be described by many texts. Image and Text have many2many relationships.
For example, to capture a picture of a running dog, people can describe the whole process. That still a caption.
For example, I prompt "running dog". Dalle 2 draws a running dog me. Yes that's a freaking caption.