farmingvillein

farmingvillein t1_j004cnd wrote

Yes, it could be a function of RL, or it could be simply how they are sampling from the distribution.

If this is something you truly want to investigate, I'd start by first running the same tests with "vanilla" GPT (to possibly include avoiding the InstructGPT variant, if you are concerned about RL distortion).

As a bonus, most of the relevant sampling knobs are exposed, so you can make it more or less conservative in terms of how widely it samples from the distribution (this, potentially, is the bigger driver in what you are seeing).

3

farmingvillein t1_izvq3i8 wrote

> I dont see anything about the input being that.

Again, this has absolutely nothing to do with the discussion here, which is about memory outside of the prompt.

Again, how could you possibly claim this is relevant to the discussion? Only an exceptionally deep lack of conceptual understanding could cause you to make that connection.

4

farmingvillein t1_izvnwdh wrote

...the whole twitter thread, and my direct link to OpenAI, are about the upper bound. The 822 number is irrelevant (given that OpenAI itself tells us that the window is much longer), and the fact that you pulled it tells me that you literally don't understand how transformers or the broader technology works, and that you have zero interest in learning. Are you a Markov chain?

2

farmingvillein t1_izvh1t9 wrote

> How do you figure BlenderBot does that?

BlenderBot paper specifically states that it is a combination of your standard transformer context window and explicit summarization operations.

> What qualifies as a technique?

Whatever would be needed to replicate the underlying model/system.

It could just be a vanilla transformer n^2 context window, but this seems unlikely--see below.

> Source?

GPT3 (most recent iteration) context window is 2048 tokens; ChatGPT is supposedly ~double (https://help.openai.com/en/articles/6787051-does-chatgpt-remember-what-happened-earlier-in-the-conversation).

This, on its own, would suggest some additional optimizations, as n^2 against a context window of (presumably) ~4096 tokens gets very expensive, and generally unrealistic.

(More generally, it would be surprising to see a scale-up to a window of that size, given the extensive research already extant on scaling up context windows, while breaking the n^2 bottleneck.)

Further, though, investigation suggests that the "official" story here is either simply not correct, or it is missing key additional techniques; i.e., under certain experimental contexts, it seems to have a window that operates beyond the "official" spec (upwards of another 2x): e.g., see https://twitter.com/goodside/status/1598882343586238464

Like all things, it could be that the answer is simply "more hardware"--but, right now, we don't know for sure, and there have been copious research papers on dealing with this scaling issue more elegantly, so, at best, we can say that we don't know. And the probabilistic leaning would be that something more sophisticated is going on.

5

farmingvillein t1_izvcehu wrote

> A) The paper tells you all the ingredients.

Maybe, maybe not--expert consensus is probaby not. BlenderBot, e.g., uses different techniques to achieve long-term conversational memory. Not clear what techniques ChatGPT is using.

> B) "apparently" means that it isnt a known effect.

General consensus is that there is either a really long context window going on or (more likely) some sort of additional long-term compression technique.

> D) Clearly nobody wants to put in the work to read the blog less the paper

Neither of these address the apparently improved long-term conversational memory improvements observed with ChatGPT--unless it turns out to just be a longer context window (which seems unlikely).

Everyone is tea-leaf reading, if/until OpenAI opens the kimono up, but your opinion is directly contrary to the expert consensus.

5

farmingvillein t1_izi021q wrote

True, but no one has really come up with a better methodology.

The best you can do is train on smaller data + make sure that you can tell yourself a story about how the new technique will still help when data is scaled up (and then hope that you are right).

(The latter is certainly argument for staying at least semi-current with the literature, as it will help you get an intuition for what might scale up and what probably won't.)

2

farmingvillein t1_ixrvyv1 wrote

I tried a few iterations and the results were...unimpressive...to say the least.

Is this configured to do fewer iterative passes (to save $$$), for example? Totally understand if so, given that it is a public/free interface...just trying to rationalize why things are so meh.

2

farmingvillein t1_ixncbjn wrote

Check out https://twitter.com/cxbln/status/1595652302123454464:

> 🎉Introducing RoentGen, a generative vision-language foundation model based on #StableDiffusion, fine-tuned on a large chest x-ray and radiology report dataset, and controllable through text prompts!

(Not your full problem, but you may find it helpful!)

More generally, you could probably use the same image-to-text techniques that get used to validate a stablediffusion model.

Or for a really quick-and-dirty solution, you could try using a model like theirs to generate training data (image, text pairs) and train an image => txt model (which they do a variant of in the paper).

1

farmingvillein t1_ixgd88a wrote

Yeah, understood, but that wasn't really what was going on here (unless you take a really expansive definition).

They were basically doing a ton of hand-calibration of a very large # of models, to achieve the desired end-goal performance--if you read the supplementary materials, you'll see that they did a lot of very fiddly work to select model output thresholds, build training data, etc.

On the one hand, I don't want to sound overly critical of a pretty cool end-product.

On the other, it really looks a lot more like a "product", in the same way that any gaming AI would be, than a singular (or close to it) AI system which is learning to play the game.

8