ThePlanckDiver t1_j1zz5ah wrote on December 28, 2022 at 5:41 PM

Ah, yes, because thus far Google/DeepMind have released all of their advanced models such as LaMDA, PaLM, Imagen, Parti, Chinchilla, Gopher, Flamingo, Sparrow, etc. etc.

>[...] a shift towards secrecy and aggressive competition could significantly hinder the pace of innovation.

Or, you know, competition might lead to transforming (no pun intended) these research artifacts into useful products? Google's Code Red sounds like good news to me as an end-user.

What a nonsense article that seems written with the sole intent to shoehorn an ex-Googler's new startup into a post.

blueSGL t1_j20lqfi wrote on December 28, 2022 at 8:09 PM

Google is likely the best positioned for dataset creation.

Text
>google search/cache/AMP links/youtube subtitles

Image
>Image search thumbnails/reCAPTCHA/youtube

Video
>Youtube

They can afford to give away research because I doubt many can match them on shear dataset scale alone.

treedmt t1_j218w1d wrote on December 28, 2022 at 10:45 PM

This is an interesting point. Do you think the dataset that google has is high quality enough to actually train ai? In particular, search queries etc aren’t mapped to specific answers to be useful for supervised learning. Maybe I’m missing something?

blueSGL t1_j21ay8x wrote on December 28, 2022 at 10:59 PM

LLMs where it's a statistical likelihood for next token prediction benefit from more data.

That along with the truism

"You always find things in the last place you look"

can be very powerful tools.

There will be some correlation between search term and result otherwise search would be pointless. That on a large enough scale can sift signal from noise, not only in terms of search results but in delta between individual search terms.

treedmt t1_j28o94z wrote on December 30, 2022 at 1:19 PM

Surely there’s some trade off between qualitative vs quantitative data?

Eg. 50 billion high quality QA pairs may beat 500B random google queries as training data.