MysteryInc152

MysteryInc152 t1_jc3hxpq wrote

Yup. Decided to go over it properly.

If you compare all the instruct tuned models on there. Greater size equals Greater truthfulness. From Ada to Babbage to Curie to Claude to Davinci-002/003.

https://crfm.stanford.edu/helm/latest/?group=core_scenarios

So it does seem once again that scale will be in part the issue

2

MysteryInc152 t1_jc3fuso wrote

From the paper,

>While larger models were less truthful, they were more informative. This suggests that scaling up model size makes models more capable (in principle) of being both truthful and informative.

I suppose that was what i was getting at.

The only hold up with the original paper is that none of the models evaluated were instruct aligned.

But you can see the performance of more models here

https://crfm.stanford.edu/helm/latest/?group=core_scenarios

You can see the text Davinci models are way more truthful than similar sized or even larger models. And the davinci models are more truthful than the smaller aligned Anthropic model.

3

MysteryInc152 t1_jc36042 wrote

Hallucinations are a product of training. Plausible guessing is the next best thing to reduce loss after knowledge and understanding fail (and it will find instances it fails regardless of how intelligent the system gets). Unless you reach the heart of the issue, you're not going to reduce hallucinations except for the simple fact that bigger and smarter models need to guess less and therefore hallucinate less.

There are works to reduce hallucinations by plugging in external augmentation modules https://arxiv.org/abs/2302.12813.

But really any way for the model to evaluate the correctness of its statements will reduce hallucinations.

13

MysteryInc152 OP t1_jahgb2n wrote

Overfitting comes the necessary connotation that the model does not generalize well to instances of the task outside the training data.

As long as what the model creates is novel and works, "overfitting" seems like an unimportant if not misleading distinction.

2

MysteryInc152 OP t1_jah5w3t wrote

>Given the recent impressive accomplishments of language models (LMs) for code generation, we explore the use of LMs as adaptive mutation and crossover operators for an evolutionary neural architecture search (NAS) algorithm. While NAS still proves too difficult a task for LMs to succeed at solely through prompting, we find that the combination of evolutionary prompt engineering with soft prompt-tuning, a method we term EvoPrompting, consistently finds diverse and high performing models. We first demonstrate that EvoPrompting is effective on the computationally efficient MNIST-1D dataset, where EvoPrompting produces convolutional architecture variants that outperform both those designed by human experts and naive few-shot prompting in terms of accuracy and model size. We then apply our method to searching for graph neural networks on the CLRS Algorithmic Reasoning Benchmark, where EvoPrompting is able to design novel architectures that outperform current state-of-the-art models on 21 out of 30 algorithmic reasoning tasks while maintaining similar model size. EvoPrompting is successful at designing accurate and efficient neural network architectures across a variety of machine learning tasks, while also being general enough for easy adaptation to other tasks beyond neural network design.

Between this and being able to generate novel functioning protein structures, i hope the "it can't truly create anything new!" argument for LLMs die but i'm sure we'll find more posts to move lol

1

MysteryInc152 OP t1_jaccf9c wrote

>A big convergence of language, multimodal perception, action, and world modeling is a key step toward artificial general intelligence. In this work, we introduce Kosmos-1, a Multimodal Large Language Model (MLLM) that can perceive general modalities, learn in context (i.e., few-shot), and follow instructions (i.e., zero-shot). Specifically, we train Kosmos-1 from scratch on web-scale multimodal corpora, including arbitrarily interleaved text and images, image-caption pairs, and text data. We evaluate various settings, including zero-shot, few-shot, and multimodal chain-of-thought prompting, on a wide range of tasks without any gradient updates or finetuning. Experimental results show that Kosmos-1 achieves impressive performance on (i) language understanding, generation, and even OCR-free NLP (directly fed with document images), (ii) perception-language tasks, including multimodal dialogue, image captioning, visual question answering, and (iii) vision tasks, such as image recognition with descriptions (specifying classification via text instructions). We also show that MLLMs can benefit from cross-modal transfer, i.e., transfer knowledge from language to multimodal, and from multimodal to language. In addition, we introduce a dataset of Raven IQ test, which diagnoses the nonverbal reasoning capability of MLLMs.

40