HunteronX t1_j761xqh wrote on February 4, 2023 at 9:58 AM

The economics is getting there for these models to be big news...
The key features of this work seem to be:

A multimodal embedding representation obtained by individual modality encoders (patch-level for images, token level for text), combined via attention.
Generate rationales first, then infer answers from them, due to accuracy reduction on answers.
(Not an expert: but is the greater % of hallucinated rationales in baseline case - no vision features - due to large 'context' needed for both rationale + answer, without those features?)

Seems that multimodal representations (language + n=? other modalities) may be important for introducing a loose physical grounding to avoid hallucinating plausible ideas/suggestions + efficient representation of the remaining ideas.