HunteronX t1_j761xqh wrote
The economics is getting there for these models to be big news...
The key features of this work seem to be:
-
A multimodal embedding representation obtained by individual modality encoders (patch-level for images, token level for text), combined via attention.
-
Generate rationales first, then infer answers from them, due to accuracy reduction on answers.
(Not an expert: but is the greater % of hallucinated rationales in baseline case - no vision features - due to large 'context' needed for both rationale + answer, without those features?)
Seems that multimodal representations (language + n=? other modalities) may be important for introducing a loose physical grounding to avoid hallucinating plausible ideas/suggestions + efficient representation of the remaining ideas.
Viewing a single comment thread. View all comments