Viewing a single comment thread. View all comments

IluvBsissa t1_j9j8ubb wrote

I don't get it. Why are they comparing their model's performance to regular humans and not experts, like every other papers ? Does it mean these tests are "average difficulty" ? I read somewhere that GPT3.5 had a 55.5% score on MMLU, while PalM was at 75 and human experts 88.8. How would this CoT model perform on standards benchmarks, then ? I feel scammed rn.

7

ertgbnm t1_j9jgoi9 wrote

Read the questions on scienceQA. They are hot and not hot dog type questions

5