Comments

You must log in or register to comment.

Destiny_Knight OP t1_j9iysi7 wrote

The paper: https://arxiv.org/pdf/2302.00923.pdf

The questions: "Our method is evaluated on the ScienceQA benchmark (Lu et al., 2022a). ScienceQA is the first large-scale multimodal science question dataset that annotates the answers with de- tailed lectures and explanations. It contains 21k multimodal multiple choice questions with rich domain diversity across 3 subjects, 26 topics, 127 categories, and 379 skills. The benchmark dataset is split into training, validation, and test splits with 12726, 4241, and 4241 examples, respectively."

40

sumane12 t1_j9j0pi7 wrote

My guy, correct me if I'm wrong, but doesn't it outperform humans, in everything but social sciences?...

177

Apollo24_ t1_j9j6zgo wrote

Makes me wonder what it'd look like if it was as big as gpt-3

56

Spire_Citron t1_j9j7gvn wrote

This is interesting because the questions seem to be reasoning based, which is much more impressive than some of the other tests AI have done well at that are more based on knowing a lot of information, something you'd expect a LLM to excel at beyond the abilities of a typical human.

25

rising_pho3nix t1_j9j7vs6 wrote

Can't imagine where we'll be in 10 more years of innovation..

109

Bakagami- t1_j9j8djw wrote

No. I haven't seen anyone talking about it because it beat humans, it was always about it beating GPT-3 with less than 1B parameters. Beating humans was just the cherry on top. The paper is "flashy" enough, including experts wouldn't change that. Many papers do include expert performance as well, it's not a stretch to expect it.

17

IluvBsissa t1_j9j8ubb wrote

I don't get it. Why are they comparing their model's performance to regular humans and not experts, like every other papers ? Does it mean these tests are "average difficulty" ? I read somewhere that GPT3.5 had a 55.5% score on MMLU, while PalM was at 75 and human experts 88.8. How would this CoT model perform on standards benchmarks, then ? I feel scammed rn.

7

jugalator t1_j9jadh0 wrote

I think there is still a ton to learn about usefulness of the training data itself, and how we can find out what is an optimal "fit" for a LLM? Right now, the big LLM's simply have the kitchen sink thrown at them. Who's to say that will automatically outperform a leaner, high quality, data set? And again, "high quality" for us me be different to an AI?

3

ground__contro1 t1_j9jbr1p wrote

I mean, humans have lived basically our entire lives, as a species and often individually, governed by forces we couldn’t really explain at the time. This still seems different though. Seems like no turning back. But maybe that’s been true for a long time.

11

hylianovershield t1_j9jd6x3 wrote

Could someone explain the significance of this to a normal person? 🤔

13

Lawjarp2 t1_j9jdxur wrote

It's multiple choice, choosing among 4 options is easier because you just have to consider the 4 possibilities and answer is among the 4 possibilities. But most conversations are open ended with possibilities branching out to insane levels.

16

redpnd t1_j9je11m wrote

Not hard for a multimodal model to outperform a text only model on multimodal tasks..

Although still impressive, imagine what a scaled up version will be able to accomplish!

10

Akimbo333 t1_j9jg1gr wrote

Yeah those multimodal capabilities are intense!

5

Artanthos t1_j9jhm3l wrote

You are setting the bar as anything less than perfect is failure.

By that standard, most humans would fail. And most experts are only going to be an expert in one field, not every field, so they would also fail by your standards.

14

duboispourlhiver t1_j9ji6qe wrote

Yes, the risk is to be over fitted for this test. I've read that too about that paper but haven't taken the time to make my own opinion. I think it's impossible to judge if this benchmark is telling or not about the model's quality without studying this for hours

18

oceanfr0g t1_j9jjagz wrote

Yeah but can it tell fact from fiction

0

Ylsid t1_j9jtgcg wrote

More than one. It takes a lot of skill, time and money, which are hard to come by if you aren't a megacorp. That isn't to say it can't happen, but that it's much more difficult than you may expect.

3

ihrvatska t1_j9jvwty wrote

Or maybe working with gpt-3. Gpt-3 could call upon it when it needed a more narrowly focused expert. Perhaps there could be a group of AI systems that work together, each having a specialty.

51

WithoutReason1729 t1_j9jx2u1 wrote

Language models seem to be a way steeper difficulty curve though. The difference between Stable Diffusion and image generators from like a few years before it is big, but the older models are still good enough to often produce viable output. But the difference between a huge language model and a large open-source one is a way bigger gap, because even getting small things wrong can lead to completely unintelligible sentences that were clearly written by a machine.

3

Cryptizard t1_j9jxvg2 wrote

It's not ordinary humans, it's people on mechanical turk who are paid to do them as fast as possible and for as little money as possible. They are not motivated to actually think that hard.

5

Queue_Bit t1_j9jyddi wrote

Yeah, for sure, but as technology improves it's just going to get easier and easier. And this technology is likely to get so good that to a normal person, the difference between the best and the world and "good enough for everyday life" is likely huge.

1

Yngstr t1_j9jzzbv wrote

Am I reading this wrong? Is the dataset used to train this model the same dataset used to test it? Not saying that's not a valid method, but that certainly makes it less impressive vs generalist models that can still get decent scores...

7

Nalmyth t1_j9k0eu2 wrote

General vs specific intelligence.

Think of when you ask your brain what is the answer to 5x5.

Did you add 5 each time or did you do a lookup, or perhaps an approximate answer?

7

mindbleach t1_j9k3p72 wrote

More training is better and smaller networks train faster.

2

shwerkyoyoayo t1_j9k77rq wrote

Is it overfit to the questions on the test? i.e. has it seen the same quesitons in training?

6

alfor t1_j9kbtxf wrote

I think it’s going to get in that direction as it’s more computer efficient.

Many speciality AI mind working together like human do.

Humans also have separation in our mind: left vs rish hemisphere, cortical columns, etc.

We specialize a lot as there is not enough mental capacity in one human to cover all aspect of human knowledge.
That’s what made chatgpt even more impressive, it’s not perfect, but it cover such a wide area compared to a human.

1

unholymanserpent t1_j9kd99g wrote

You're right, but that doesn't even begin to scratch the surface of potential issues with this change of status. Screw pets, what's the point of having us around? The gap in intelligence between us and them could be like the difference between us and worms. AI having us as pets is definitely on the optimistic side of things

3

FaceDeer t1_j9kgcvd wrote

Perhaps you could have a specialist AI whose specialty was figuring out which other specialist AI it needs to pass the query to. If each specialist can run on home hardware that could be the way to get our Stable Diffusion moment. Constantly swapping models in memory might slow things down, but I'd be fine with "slow" in exchange for "unfettered."

18

BasedBiochemist t1_j9khtzg wrote

I'd be interested to know what social science questions the AI was getting wrong compared to the humans.

2

VeganPizzaPie t1_j9kjuth wrote

But if you have to reason out the correct answer, and do that over and over again, it doesn't matter if there are 4 options or 1000. Think about it. The bar exam and other post graduate tests have multiple choice. You think anyone could pass those? Why do they take years of study?

1

EndTimer t1_j9kl706 wrote

We would have to read the study methodology to evaluate how they were testing GPT 3.5's image context.

But in this case, multimodal refers to being trained on not just text (like GPT 3.5), but also images associated with that text.

That seems to have improved their model, which requires substantially fewer parameters while scoring higher, even in text-only domains.

4

TinyBurbz t1_j9klnc9 wrote

Wow a specialized model out preforms a generalized one?

*shocked pikachu*

2

drizel t1_j9kmtd9 wrote

In a year or two (or less even?) we'll definitely (probably) have these running locally.

1

SgathTriallair t1_j9knp1a wrote

Agreed. Stage one was "cogent", stage two was "as good as a human", stage three is "better than all humans". We have already passed stage 2 which could be called AGI. We will soon hit stage 3 which is ASI.

−1

skob17 t1_j9ks9jw wrote

That's an interesting approach. We do the same in audits, where the main host is QA with a broad general knowledge of all processes, but for details they call in the SMEs to show all the details of a specific topic.

5

Denpol88 t1_j9l10kd wrote

Can anyone ask this search.to Bing and share with us here, please?

1

EndTimer t1_j9l1xxj wrote

Presumably, TXT (text context). LAN (language sciences) are unlikely to have many images in their multiple choice questions. The other science domains and G1-12 probably have majority text questions.

1

FirstOrderCat t1_j9l8xn9 wrote

Yes, and then reproduce results from both papers, check the code to see nothing creative happens in datasets or during training, and there are much more claims in the academia than one has time to verify.

1

Ambiwlans t1_j9lab3g wrote

At this point we don't really know what is bottlenecking. More params is an easyish way to capture more knowledge if you have the architecture and the $$... but there are a lot of other techniques available that increase the efficiency of the parameters.

9

dwarfarchist9001 t1_j9lb1wl wrote

Yes but how many parameters must you actually have to store all the knowledge you realistically need. Maybe a few billion parameters is enough to store the basics of every concept known to man and more specific details can be stored in an external file that the neural net can access with API calls.

5

Ohigetjokes t1_j9lh78h wrote

This reminds me of that Westworld moment where he’s talking about the frustrations trying to emulate humans, until he realized he just needs to use a lot less code (something like 18 lines).

“Turns out we’re just not that complicated.”

3

nillouise t1_j9lhlwo wrote

In github, I only can download the base model, is the large model private? But I think it will be more useful to me if the model is not sicence QA instead of a game player model.

1

Lawjarp2 t1_j9liaa5 wrote

No. Once an LLM gets a keyword a lot of related stuff will come up in probabilities. Also you can go backwards on reasoning. This makes it easier for an LLM to answer if trained for this exact scenario.

2

sprucenoose t1_j9m63h2 wrote

All of humanity compressed into the thin strip of lifeless sand comprising Earth's border between the land and salty depths, with no sustenance except a single alcoholic beverage?

3

AustinJacob t1_j9m8vgk wrote

Considering that GPT-J can run on local hardware and it has 6B parameters this gives me great hope for the future of open-source and non-centralized ai.

3

norbertus t1_j9me8dv wrote

A lot of these models are under-trained

https://www.deepmind.com/publications/an-empirical-analysis-of-compute-optimal-large-language-model-training

and seem to be forming a type of "lossy" text compression, where their ability to memorize data is both poorly understood, and accomplished using only a fraction of the information-theoretic capacity of the model design

https://arxiv.org/pdf/1802.08232.pdf

Also, as indicated in the first citation above, it turns out that the quality of large language models is more determined by the size and quality of the training set rather than the size of the model itself.

1

DukkyDrake t1_j9mgudi wrote

This will usually be the case. A tool optimized and fit for a particular purpose will usually outperform.

2

Prufrock5150_ t1_j9mn6lw wrote

Why does this feel like the next "Getting DOOM to run on a pocket-calculator" challenge lol? Because I am *here* for that.

1

Anenome5 t1_j9n56lk wrote

We learned that you can get the same result from less parameters and more training. It's a tradeoff thing, so I'm not entirely surprised. We cannot assume that GPT's approach is the most efficient one out there, if anything it's just brute force effectiveness and we should desperately hope that the same or better results can be achieved with much less hardware ultimately. And so far it appears that this is true and is the case.

3

13ass13ass t1_j9n98ui wrote

It’s fine tuned on the dataset. No big whoop

1

monsieurpooh t1_j9nh885 wrote

Anyone who's a staunch opponent of the idea of philosophical zombies (to which I am more or less impartial) could very well be open to the idea that ChatGPT is empathetic. If prompted well enough, it can mimic an empathetic person with great realism. And as long as you don't let it forget the previous conversations it's had nor exceed its memory window, it will stay in character and remember past events.

2

monsieurpooh t1_j9ni0aa wrote

I'm curious how the authors made sure to prevent overfitting. I guess there's always the risk they did, which is why they have those AI competitions where they completely withhold questions from the public until the test is run. Curious to see its performance in those

2

koltregaskes t1_j9nmoh3 wrote

I believe OpenAI or others have said they've concluded that parameters are less important and data is more important. So the models need more data... a lot more. And text data alone won't be enough.

1

Fedude99 t1_j9ps9w4 wrote

Religion is just anything you have faith ("belief") in without understanding the belief justification chains (or even that there is such a thing as different kinds of links in belief justification chains).

Thus, modern atheists are religious as well as they don't actually understand the Science (tm) and "logic" that shapes their beliefs, and they end up in culture war battles no different from early religious wars.

Modern science can no longer even predict what a man or woman is, which is just as simple as predicting what color the sky is. As an atheist myself, it's important to acknowledge the win religion has on this one.

3