Viewing a single comment thread. View all comments

MysteryInc152 t1_j8ppoiq wrote

I'd rather the basic senses at least (vision as well as audio) be pretrained as well. We know from Multimodal chain of thought as well as scaling laws for generative mixed modal language models that multimodal models far outperform single modal models on the same data and scale. You won't get that kind of performance gain leveraging those basic senses to outside tools.

https://arxiv.org/abs/2302.00923

https://arxiv.org/abs/2301.03728

2