harharveryfunny

harharveryfunny t1_j7kmbzr wrote

OpenAI just got a second round $10B investment from Microsoft, so that goes a ways ... They are selling API access to GPT for other companies to use however they like, and Microsoft has integrated Copilot (also GPT-based, fine-tuned for code generation) into their dev tools, and MIcrosoft is also integrating OpenAI's LLM tech into Bing. While OpenAI are also selling access to ChatGPT to end users, I doubt that's going to really be a focus for them or major source of revenue.

1

harharveryfunny t1_j7kjohr wrote

I tried perplexity.ai for first time yesterday, and was impressed by it. While it uses GPT 3.5 it's not exactly comparable to ChatGPT since it's really an integration of Bing search with GPT 3.5, as you can tell by asking it about current events (and also by asking it about itself!). I'm not sure exactly how they've done the integration, but the gist of it seems to be more that GPT/chat is being used as an interface to search, rather than ChatGPT where the content itself is being generated by GPT.

Microsoft seem to be following a similar approach per the Bing/Chat verson that popped up and disappeared a couple of days ago. It was able to cite sources, which isn't possible for GPT-generated content which has no source as such.

2

harharveryfunny t1_j1d5m40 wrote

Yes - not sure if everyone understands this. ChatGPT took GPT 3.5 as a starting point, but then has a reinforcement learning stage on top of that which has aligned it's output to what humans want from a question-answering chat-bot. It's basically the next generation InstructGPT.

https://arxiv.org/abs/2203.02155

From a quick scan of the Bloomz link, that seems to be just an LLM (i.e. more like GPT-3), not an instruction/human aligned chat-bot. There's a huge qualitative difference.

2

harharveryfunny t1_ixcpady wrote

It seems to me the primary learning mode in the brain - what it fundamentally/automatically does via it's cortical architecture - is sequence prediction (as in predict next word). Correspondingly the primary way we learn language as a child is by listening and copying, and the most efficient language learning methods for adults have also been found to be immersive.

Reinforcement learning can also be framed in terms of prediction (predicting reward/response), and I suspect this is the way that "learning via advice" (vs experience) works, while noting that the former seems more fundamental and powerful - note how we learn more easily from our own experience rather than the advice of others.

I think reinforcement learning is over-hyped, and in animals reward-maximization is more behavior (based on predictive mechanism) than actual mechanism itself.

As far as ML goes, RL as mechanism seems a very tricky beast, notwithstanding the successes of DeepMind, whereas predictive transformer-based LLMs are simple to train and ridiculously powerful, exhibiting all sorts of emergent behavior.

I can't see the motivation for wanting to develop RL-based language models - makes more sense to me to do the opposite and pursue prediction-based reward maximization.

1

harharveryfunny t1_itn57pm wrote

They increase the sampling "temperature" (amount of randomness) during the varied answer generation phase, so they will at least get some variety, but ultimately it's GIGO - garbage-in => garbage out.

How useful this technique is would seem to depend on the quality of data it was initially trained on and the quality of deductions it was able to glean from that. Best case this might work as a way to clean up it's training data by rejecting bogus conflicting rules it has learnt. Worst case it'll reinforce bogus chains of deduction and ignore the hidden gems of wisdom!

What's really needed to enable any system to self learn is to provide feedback from the only source that really matter - reality. Feedback from yourself, based on what you think you already know, might make you more rational, but not more correct!

2

harharveryfunny t1_itn0gfi wrote

They're not scaling up the model, more like making the model more consistent when answering questions:

  1. Generate a bunch of different answers to the same question

  2. Assume most common answer to be the right one

  3. Retrain with this question and "correct" answer as an input

  4. Profit

It's kind of like prompt engineeering - they're not putting more data or capability into the model, but rather finding out how to (empirically) make the best of what it has already been trained on. I guess outlier-answer-rejection would be another way of looking at it.

Instead of "think step by step", this is basically "this step by step, try it a few times, tell me the most common answer", except it can't be done at runtime - requires retraining the model.

1

harharveryfunny t1_irxuxr9 wrote

Yes, I agree about the relative complexity (not that an LSTM doesn't also have a fair bit of structure), but the bitter lesson requires an approach that above all else will scale, which transformers do.

I think many people, myself included, were surprised with the emergent capabilities of GPT-3 and derivatives such as OpenAI Codex ... of course it makes sense how much domain knowledge (about fairy tales, programming, etc, etc) is needed to be REALLY REALLY good at "predict next word", but not at all obvious that something as relatively simple as a transformer was sufficient to learn that.

At the end of the day any future architecture capable of learning intelligent behavior will have to have some amount of structure - it needs to be a learning machine, and that machine needs some cogs. Is the transformer more complex than necessary for what it is capable of learning? I'm not sure - it's certainly conceptually pretty minimal.

1

harharveryfunny t1_irvssm1 wrote

It seems transformers really have two fundamental advantages over LSTMs:

  1. By design (specifically to improve over the shortcomings of recurrent models), they are much more efficient to train since samples can be presented in parallel. Also, positional encoding allows transformers to more accurately deal with positional structure which is critical for language.
  2. Transformers scale up very successfully. Per Rich Sutton's "Bitter Lesson", generally dumb methods that scale up in terms of ability to usefully absorb compute and data do better than more highly engineered "smart" methods. I wouldn't argue that transformers are any simpler in architecture than LSTMs, but as GPT-3 proved they do scale very successfully - increasing performance while still being relatively easy to train.

The context of your criticism is still valid though. Not sure whether it's fair or not, but I tend to look at DeepMind's recent matrix multiplication paper like that - they are touting it as a success of "AI" and RL, when really it's not at all apparent what RL is adding here. Surely the tensor factorization space could equally well been explored by other techniques such as evolution or even just MCTS.

44

harharveryfunny t1_irtbnwz wrote

GPT-3 isn't an attempt at AI. It's literally just a (very large) language model. The only thing that it is designed to do is "predict next word", and it's doing that in a very dumb way via the mechanism of a transformer - just using attention (tuned via the massive training set) to weight the recently seen words to make that prediction. GPT-3 was really just an exercise in scaling up to see how much better (if at all) a "predict next word" language model could get if the capacity of the model and size of the training set were scaled up.

We would expect GPT-3 to do a good job of predicting next word in a plausible way (e.g. "the cat sat on the" => mat), since that it literally all it was trained to do, but the amazing, and rather unexpected, thing is that it can do so much more ... Feed it "There once was a unicorn", and it'll start writing whole fairy tale about unicorns. Feed Codex "Reverse the list order" and it'll generate code to perform that task, etc. These are all emergent capabilities - not things that it was designed to do, but things that needed to learn to do (and evidentially was capable of learning, via its transformer architecture) in order to get REALLY good at it's "predict next word" goal.

Perhaps the most mind blowing Codex capability was the original release demo video from OpenAI where it had been fed the Microsoft Word API documentation, then was able to USE that information to write code to perform a requested task ("capitalize first letter of each word" if I remember correctly)... So think about it - it was only designed/trained to "predict next word", yet is capable of "reading API documentation" to write code to perform a requested task !!!

Now, this is just a language model, not claiming to be an AI or anything else, but it does show you the power of modern neural networks, and perhaps give some insight into the relationship between intelligence and prediction.

DALL-E isn't claiming to be an AI either, and has a simple flow-through architecture. It basically just learns a text embedding which it maps to an image embedding which is then decoded to the image. To me it's more surprising that something so simple works as well as it does, rather than disappointing that it only works for fairly simple types of compositional requests. It certainly will do its best to render things it was never trained on, but you can't expect it to do very well with things like "two cats wrestling" since it has no knowledge of cat's anatomy, 3-D structure, or how their joints move. What you get is about what you'd expect given what the model consists of. Again, its a pretty simple flow thru text-to-image model, not an AI.

For any model to begin to meet your expectations of something "intelligent" it's going to have to be designed with that goal in the first place, and that it still in the future. So, GPT-3 is perhaps a taste of what is to come... if a dumb language model is capable of writing code(!!!), then imagine what a model that is actually designed to be intelligent should be capable of ...

2

harharveryfunny t1_irslrit wrote

GPT-3 isn't a layered architecture - the proper name for it is a "transformer". It's highly structured. Nowadays there are many different architectures in use.

The early focus on layers was because there are simple things, such as learning an XOR function, that a single layer network can't do, but originally (50 years ago) no-one knew how a multi-layer network could be trained. The big breakthough therefore was when c. 1980 the back-propagation algorithm was invented which solved the multi-layer "credit assignment" training problem (and also works on any network shape - graphs, etc).

The "modern neural net era" really dates to the ImageNet image recognition competition in 2012 (only 10 years ago!) when a multi-layer neural net beat older non-ANN image recognition approaches by a significant margin. For a number of years after that the main focus of researchers was performing better on this ImageNet challenge with ever more elaborate and deeper multi-layer networks.

Today, ImageNet is really a solved problem, and the focus of neural nets has shifted to other applications such as language translation, speech recognition, generative language models (GPT-3) and recent text-to-image and text-to-video networks. These newer applications are more demanding and have required more sophisticated architectures to be developed.

Note that our brain actually has a lot of structure to it - it's not just one giant graph where any neuron could connect to any other neuron. For example, our visual cortex actually has what is roughly a layered archtecture (V1-V5), which is why multi-layer nets have done so well in vision applications.

2

harharveryfunny t1_irrfflr wrote

Well, any scientific or engineering field is going to progress from simple discoveries and techniques to more complex ones, and the same applies to artificial neural networks.

If you look at the history of ANNs, they were originally limited to a single layer ("Perceptron") until the discovery of how multi-layer networks (much more powerful!) could be trained via back-propagation and SGD which is what has led to the amazing capabilities we have today.

The history of ANNs is full of sceptics who couldn't see the promise of where the technology was heading, but rather than being sceptical I think today's amazing capabilities should make you optimistic that they will continue to become more powerful as new techniques continue to be discovered.

2

harharveryfunny t1_irn8jzr wrote

The way a neural net learns is by comparing the neural net's current output, for a given input, to the correct/preferred output, then "back-propagating" this difference (error) information backwards though the net to incrementally update all the weights (that represent what it has learned).

During training you know the correct/preferred output for any input since this is provided by your training data, which consists of (input, output) pairs. For each training pair, and corresponding output error, the network's weights are only updated a *little* bit, since you want to take ALL the training samples into account. You don't want to make the net totally correct for one sample at the expense of being wrong for the others, so the way this is handled is by repeating all the training samples multiple times with small updates until the net has been tweaked to minimize errors for ALL of them.

If we're talking specifically about a language model like GPT-3, then the training data consists of sentences and the training goal is "predict next word" based on what it has seen so far. For example, if one training sample was the sentence "the cat sat on the mat", then after having seen "the cat" the correct output is "sat", and if the net has seen "the cat sat on the", then the correct output is "mat".

So, with that background, there are two problems to having GPT-3 learn continuously, not just during training:

  1. When you are (after training) using GPT-3 to generate text, you have no idea what possible words it should be outputting next! You start with an initial "prompt" (sentence) and use GPT-3 to "predict next word", then if you want another word of output you take that first generated word and feed it back in, and now have GPT-3 predict the *next* word, etc. This isn't like training where you already know the whole sentence ahead of time - when actually using the model you are just generating one word at a time, and have no idea what *should* come next. Since you have no idea of what is correct, you can't derive an error to update the model.

  2. Even if you somehow could come up with an error for each word that GPT-3 generates, then how much should you update the weights? Just like during training you don't want to make a big update and make the net correct for the current input but wrong for all other inputs, but unlike training you can't handle this by just updating a little and then re-presenting the entire training set (plus whatever else you've fed into GPT-3 since then) to make updates for those too. This is what another reply is referring to as the "catastrophic forgetting" problem - how would you, after training, continue learning (i.e. continue updating the model's weights) without disrupting everything it has already learned?

The reason our brains *can* learn continuously is because they are "solving" a different problem. Our brain is also learning to "predict next thing", but in this case the "thing" it is predicting is current sensory inputs/etc - the never-ending stream of reality we are exposed to. So, our brain can predict what comes next, and there always will be an actual next "thing" that is happening/being experienced for it to compare to. If our brain was wrong ("surprised") by what actually happened vs what was predicted, then it can use that to update itself to do better next time.

It's not totally clear how our brains handle the "catastrophic forgetting" problem, but it certainly indicates it is using weights in a bit of a different way to our artificial neural networks. It may be related to the idea of "sparsity".

1

harharveryfunny t1_ir7b7aw wrote

If model load time is the limiting factor then ONNX runtime speed may be irrelevant. You may need to load the model once and reuse it, rather than loading each time.

There's a new runtime (TensorRT competitor) called TemplateAI available from Facebook, that does support CPU and is meant to be very fast, but I don't believe they yet support ONNX, and anyways you're not going to get a 50x speed-up just by switching to a faster runtime on same hardware,

Another alternative might be to run it in the cloud rather than locally.

3

harharveryfunny t1_ir76pn9 wrote

Well, the gist of it is that they first transform the minimal-factors matmul problem into decomposition of a 3-D matrix into minimal number of factors, then use RL to perform this decomposition by making it a stepwise decomposition with the reward being mininum number of steps.

That said, I don't understand *why* they are doing it this way.

  1. Why solve the indirect decomposition problem, not just directly search for factors of the matmul itself ?

  2. Why use RL rather than some other solution space search method like an evolutionary algorithm? Brute force checking of all solutions is off the table since the search space is massive.

32

harharveryfunny t1_iqrodvi wrote

What you have drawn there is a recurrent model where there is a feedback loop. The only way that can work is if what you are feeding back into a layer/node is from the *previous* input/time-step, otherwise you've got an infinite loop.

You can certainly build something like that if you want you (2nd "layer" is an LSTM or RNN), however this is total overkill if all you are trying to do is build is a binary classifier. Depending on your input a single fully-connected layer may work, else add more layers.

1