farmingvillein t1_j6iwb5v wrote
I like the big idea, and it is almost certainly indicative of one of the key tools to improve automated programming.
That said, I wish they had avoided the urge to build an intermediate programming language. This is likely unnecessary and is the type of semi-convoluted solution that you only come up with in an academic research lab (or out of true, deep product need--but I think that is highly unlikely the case).
My guess is that the same basic result in the paper could have been shown by using Python or Rust or similar as the root language, with a little work (time that you could have obtained by swapping out effort spent on the harry potter language development).
They do note:
> We generate 16 Python implementations per high-level plan on 100 randomly sampled problems and find that the performance drops to 6%.
But it isn't well-discussed (unless I skimmed too quickly) as to why a separate language is truly needed. They discussion advantages of Parsel, but there doesn't appear to be a deep ablation on why it is really necessary or where its supposed performance benefits come from, or how those could be enforced in other languages.
There is a bunch of discussion in the appendix, but IMO none of it is very convincing. E.g., Parsel enforces certain conventions around testing and validation...great, lets do that in Python or Rust or similar. Or--leveraging the value of LLMs--through a more natural language interface.
Yes, there is benefit to bridging these gap in a "universal" manner...but, as per https://xkcd.com/927/, a new programming language is rarely the right solution.
ezelikman t1_j6lx0vm wrote
Hi, author here!
There are a few ways to interpret this question.
The first is, "why generate a bunch of composable small functions - why not generate complete Python/Lean/etc. implementations directly from the high-level sketch?" If you generate 10 complete implementations, you have 10 programs. If you generate 10 implementations of four subfunctions, you have 10,000 programs. By decomposing problems combinatorially, you call the language model less. You can see the benefits in Fig. 6 and our direct compilation ablation. There's also the context window: a hundred 500-token functions from Parsel is a 50,000-token program. You won't get that with Codex alone.
Another interpretation is, "why do you need to expose intermediate language when you can use a more abstract intermediate representation." You suggest "leveraging the value of LLMs--through a more natural language interface." That's the goal. Parsel is intentionally basically indented natural language w/ unit tests. There's minimal extra syntax for efficiency and generality - ideally, people who've never used Python can understand and write Parsel. The "expert" details here aren't syntax: most people are unfamiliar with the nuances of writing natural language that automatically compiles to code, like the value of comprehensive unit tests.
Another is, "why design a new language instead of writing this as, e.g., a Python library?" My response is we did this too. Internally, Parsel is in Python, and a "Function" class already exists - you can find it on GitHub. Still, you need a process to generate implementations and select one satisfying the constraints, which we call the compiler.
Hope this answers your question!
farmingvillein t1_j6nxa0i wrote
> If you generate 10 complete implementations, you have 10 programs. If you generate 10 implementations of four subfunctions, you have 10,000 programs. By decomposing problems combinatorially, you call the language model less
Yup, agreed--this was my positive reference to "the big idea". Decomposition is almost certainly very key to any path forward in scaling up automated program generation in complexity, and the paper is a good example of that.
> Parsel is intentionally basically indented natural language w/ unit tests. There's minimal extra syntax for efficiency and generality.
I question whether the extra formal syntax is needed, at all. My guess is, were this properly ablated, it probably would not be. LLMs are--in my personal experience, and this is obviously born out thematically--quite flexible to different ways in representing, say, unit input and outputs. Permitting users to specify in a more arbitrary manner--whether in natural language, pseudocode, or extant programming languages--seems highly likely to work equally well, with some light coercion (i.e., training/prompting). Further, natural language allows test cases to be specified in a more general way ("unit tests: each day returns the next day in the week, Sunday=>Monday, ..., Saturday=>Sunday") that LLMs are well-suited to work with. Given LLM's ability to pick up on context and apply it, as well, there is a good chance that free-er form description of test cases are likely to drive improved performance.
If you want to call that further research--"it was easier to demonstrate the value of hierarchical decomposition with a DSL"--that's fine and understood, but I would call it out as a(n understandable) limitation of the paper and an opportunity for future research.
[deleted] t1_j6j9yun wrote
[deleted]
farmingvillein t1_j6jdazy wrote
This is, at best, a distinction without a difference.
The authors literally describe it as "language".
It gets "compiled".
It generates a "Parsel program".
It holds a distinct learning curve such that a user can be an "expert".
The point here is that it is a unique specification that needs to be separately learned--it asks the user to learn, in essence, a domain-specific language. Or, if you prefer, a domain-specific specification; the point stands either way.
theunixman t1_j6jff5n wrote
We have to learn APIs all the time, and basically they're all DSLs that just don't admit they are so they're even harder.
farmingvillein t1_j6jgv48 wrote
And this isn't a good thing, it is a necessary thing--we do it because someone bundled some logic together and you need to interact with it.
None of this addresses whether or why something like Parsel is necessary as an intermediate step. The authors do very little to justify the necessity of an intermediate representation; there is no meaningful analysis of why it apparently performs better, nor an ablation analysis to try to close the gaps.
The key benefits--like enforced test cases--could, hypothetically, very easily be enforced in something like Python, or many other languages.
And given the massive volumes of training data we have for these other languages, there are a lot of good reasons to think that we should be able to see equal or better behavior than with a wholly manufactured pseudocode (effectively) language.
The paper would have been much more convincing and interesting if, e.g., they started with something like python and progressively added the restrictions that apparently helped Parsel provide higher quality results.
abcdchop t1_j6m17n8 wrote
wait bro the key benefit is the the hierarchical description -- the "language" is just a format for explaining the hierarchical description of the problem in natural language, I think that the improvements your suggesting pretty much describe the paper itself
farmingvillein t1_j6n4hqy wrote
> wait bro the key benefit is the the hierarchical description
agreed
> I think that the improvements your suggesting pretty much describe the paper itself
Allow users to work in actual unstructured language, or an extant programming language, and I'd agree.
theunixman t1_j6jhf69 wrote
Right, turning it into an actual DSL would be much better, and then you'd have better semantics for the library. But honestly I'm bored talking about aesthetics already, peace.
[deleted] t1_j6jnws8 wrote
[deleted]
Viewing a single comment thread. View all comments