We’re all waiting for the day that a GPT-3 scale model is released which integrates text, video, images, and audio. We’ve seen some progress on this front - namely Gato. But nothing that has really wow’ed us yet like ChatGPT or LaMDA. PaLM is really the only exception to this rule, but it was images and text only.

I think we all know this is coming soon, I’m wondering if anyone here is aware of any indications of this actively being worked on, or has any predictions for release dates. Especially for a video model.

A model which can take any combination of video, audio, image, and text tokens as input and output would most likely be very, very remarkable, making ChatGPT look like a toy in comparison.

Comments

You must log in or register to comment.

Sashinii t1_j82ro4w wrote on February 11, 2023 at 4:57 AM

#1,777,092

They're still being developed. When they're ready, they'll be released to the general public (granted, probably not by the big companies, but they'll be open source versions by Stability AI).

adt t1_j831ml0 wrote on February 11, 2023 at 6:49 AM

#1,777,528

There is an entire world outside of California...

Germany: Luminous 200B multimodal.

China: All of the ERNIE 260B cross-modal stuff.

^(Yeh, you need) ^(The Memo)^(!)

MysteryInc152 t1_j83uty8 wrote on February 11, 2023 at 1:13 PM

#1,778,697

Replying to adt (#1,777,528)

Only the 17b and 30b models are multimodal. Still pretty good though for sure.

We also have some recent advances that ground frozen language models to images. Namely BLIP-2 and fromage.

Akimbo333 t1_j8433cl wrote on February 11, 2023 at 2:25 PM

#1,779,145

Replying to adt (#1,777,528)

Wow!

ReadSeparate OP t1_j8442mf wrote on February 11, 2023 at 2:33 PM

#1,779,192

Replying to adt (#1,777,528)

This is exactly the comment I was looking for when I made this thread, thanks so much

maskedpaki t1_j853khi wrote on February 11, 2023 at 6:16 PM

#1,781,222

chatgpt will grow into a multimodal model im guessing. they are updating every couple of weeks and are charging real money now for plus. Its going to take off really quick.

MysteryInc152 t1_j85rgjx wrote on February 11, 2023 at 9:02 PM

#1,782,623

Recently 2 papers were released that dealt with making frozen LLMs multimodal (with coffee and models released).

Blip-2 - https://arxiv.org/abs/2301.12597 https://huggingface.co/spaces/Salesforce/BLIP2

And fromage - https://arxiv.org/abs/2301.13823 https://github.com/kohjingyu/fromage