farmingvillein

farmingvillein t1_j8frv87 wrote

> not to use language models to interact with the world (which seems trivial to me, sorry),

The best argument here is that "true" intelligent requires "embedded" agents, i.e., agents that can interact with our (or, at least, "a") world (to learn).

Obviously, no one actually knows what will make AGI work, if anything...but it isn't a unique/fringe view OP is suggesting.

1

farmingvillein t1_j7ibgcn wrote

> wrong information from these models is pretty rare

This is not born at out all by the literature. What are you basing this on?

There are still significant problems--everything from source material being ambiguous ("President Obama today said", "President Trump today said"--who is the U.S. President?) to problems that require chains of logic happily hallucinating due to one part of the logic chain breaking down.

Retrieval models are conceptually very cool, and seem very promising, but statements like "pretty rare" and "don't have that issue" are nonsense--at least on the basis of published SOTA.

Statements like

> I don't think it needs to be 100% resolved for it to be a viable replacement for a search engine.

are fine--but this is a qualitative value judgment, not something grounded in current published SOTA.

Obviously, if you are sitting at Google Brain and privy to next-gen unpublished solutions, of course my hat is off to you.

12

farmingvillein t1_j7i567e wrote

This is an interesting choice--on the one hand, understandable, on the other, if it looks worse than chatgpt, they are going to get pretty slammed in the press.

Maaaybe they don't immediately care, in that what they are trying to do is head off Microsoft offering something really slick/compelling in Bing. Presumably, then, this is a gamble that Microsoft won't invest in incorporating a "full" chatgpt in their search.

8

farmingvillein t1_j6nxa0i wrote

> If you generate 10 complete implementations, you have 10 programs. If you generate 10 implementations of four subfunctions, you have 10,000 programs. By decomposing problems combinatorially, you call the language model less

Yup, agreed--this was my positive reference to "the big idea". Decomposition is almost certainly very key to any path forward in scaling up automated program generation in complexity, and the paper is a good example of that.

> Parsel is intentionally basically indented natural language w/ unit tests. There's minimal extra syntax for efficiency and generality.

I question whether the extra formal syntax is needed, at all. My guess is, were this properly ablated, it probably would not be. LLMs are--in my personal experience, and this is obviously born out thematically--quite flexible to different ways in representing, say, unit input and outputs. Permitting users to specify in a more arbitrary manner--whether in natural language, pseudocode, or extant programming languages--seems highly likely to work equally well, with some light coercion (i.e., training/prompting). Further, natural language allows test cases to be specified in a more general way ("unit tests: each day returns the next day in the week, Sunday=>Monday, ..., Saturday=>Sunday") that LLMs are well-suited to work with. Given LLM's ability to pick up on context and apply it, as well, there is a good chance that free-er form description of test cases are likely to drive improved performance.

If you want to call that further research--"it was easier to demonstrate the value of hierarchical decomposition with a DSL"--that's fine and understood, but I would call it out as a(n understandable) limitation of the paper and an opportunity for future research.

4

farmingvillein t1_j6n4hqy wrote

> wait bro the key benefit is the the hierarchical description

agreed

> I think that the improvements your suggesting pretty much describe the paper itself

Allow users to work in actual unstructured language, or an extant programming language, and I'd agree.

1

farmingvillein t1_j6jgv48 wrote

And this isn't a good thing, it is a necessary thing--we do it because someone bundled some logic together and you need to interact with it.

None of this addresses whether or why something like Parsel is necessary as an intermediate step. The authors do very little to justify the necessity of an intermediate representation; there is no meaningful analysis of why it apparently performs better, nor an ablation analysis to try to close the gaps.

The key benefits--like enforced test cases--could, hypothetically, very easily be enforced in something like Python, or many other languages.

And given the massive volumes of training data we have for these other languages, there are a lot of good reasons to think that we should be able to see equal or better behavior than with a wholly manufactured pseudocode (effectively) language.

The paper would have been much more convincing and interesting if, e.g., they started with something like python and progressively added the restrictions that apparently helped Parsel provide higher quality results.

0

farmingvillein t1_j6jdazy wrote

This is, at best, a distinction without a difference.

The authors literally describe it as "language".

It gets "compiled".

It generates a "Parsel program".

It holds a distinct learning curve such that a user can be an "expert".

The point here is that it is a unique specification that needs to be separately learned--it asks the user to learn, in essence, a domain-specific language. Or, if you prefer, a domain-specific specification; the point stands either way.

4

farmingvillein t1_j6iwb5v wrote

I like the big idea, and it is almost certainly indicative of one of the key tools to improve automated programming.

That said, I wish they had avoided the urge to build an intermediate programming language. This is likely unnecessary and is the type of semi-convoluted solution that you only come up with in an academic research lab (or out of true, deep product need--but I think that is highly unlikely the case).

My guess is that the same basic result in the paper could have been shown by using Python or Rust or similar as the root language, with a little work (time that you could have obtained by swapping out effort spent on the harry potter language development).

They do note:

> We generate 16 Python implementations per high-level plan on 100 randomly sampled problems and find that the performance drops to 6%.

But it isn't well-discussed (unless I skimmed too quickly) as to why a separate language is truly needed. They discussion advantages of Parsel, but there doesn't appear to be a deep ablation on why it is really necessary or where its supposed performance benefits come from, or how those could be enforced in other languages.

There is a bunch of discussion in the appendix, but IMO none of it is very convincing. E.g., Parsel enforces certain conventions around testing and validation...great, lets do that in Python or Rust or similar. Or--leveraging the value of LLMs--through a more natural language interface.

Yes, there is benefit to bridging these gap in a "universal" manner...but, as per https://xkcd.com/927/, a new programming language is rarely the right solution.

20

farmingvillein t1_j5utusn wrote

You're probably right, but has anyone built an updated set of benchmarks to compare chatgpt with Google's publicly released numbers? (Maybe yes? Maybe I'm out of the loop?) Chatgpt is sufficiently different than gpt3.5 that I think we'd need to rerun benchmarks to compare.

(And, of course, even if we did, there are open questions of potential data leakage--always a concern, but maybe an extra concern here, since it is unclear whether OpenAI would have prioritized that issue in chatgpt build out. Certainly would have been low on my list, personally.)

1

farmingvillein t1_j3xui1m wrote

No, you can edit your original post and place it in there:

> OpenAI must be super confident about the generality of their AI and Microsoft product integration.

<-- add your link here.

> During weekdays, if you'd like to share a link, place it in a self-post and provide some context.

4

farmingvillein t1_j2awxls wrote

Yes, and the old one was named relatively sanely:

> LAnguage Modeling Broadened to Account for Discourse Aspects

Whereas the new Google paper is a horror show in naming:

> We develop a hybrid LAnguage Model augmented BAckwarD chAining technique, dubbed LAMBADA

7

farmingvillein t1_j0ifmkt wrote

This also would probably be a good way to gather data on where the model may not be working.

If a relatively recent systematic review is giving a different result than a contemporaneous and/or older set of papers, it is probably (would need to verify this empirically) more likely that something is being processed incorrectly.

(Reviews obviously also aren't perfect--but my guess is that you'd find that they are pretty robust indicators of something being off.)

1

farmingvillein t1_j0fh5lg wrote

> If our words came out that way, people would know what you were going to say without even having to say it.

Even if this were true, this would not be correct in any sort of general sense, since every person/agent has its own unique set of (incompletely observable) context that seeds any output.

1