I recently came across " Confident Adaptive Language Modeling " which allows Transformers to exit early during inference and not use all model layers if a token is easy to predict. Is there any research on basically doing the opposite and allowing Transformers to spent more compute on tokens that are very hard to predict?

Comments

You must log in or register to comment.

rehrev t1_j414ibv wrote on January 12, 2023 at 1:28 PM

#1,340,694

What does early stopping inference mean tho

icecubeinanicecube t1_j415q2o wrote on January 12, 2023 at 1:38 PM

#1,340,769

How would you even do that? Once you have run inference through all layers, you can not just randomly pull additional layers out of thin air, can you?

amrit_za t1_j418a4l wrote on January 12, 2023 at 1:58 PM

#1,340,903

It sounds like what you're considering the "opposite" is just a reframing of original task i.e. if a token is difficult to predict, then more layers (and therefore compute) would be used used. If it's easy, fewer layers. Am I missing something from what you're asking?

Chemont OP t1_j41eamz wrote on January 12, 2023 at 2:42 PM

#1,341,306

Replying to amrit_za (#1,340,903)

I should have been clearer with my question. What I was wondering was, if there are any extensions to the Transformer architecture that allow it to, in theory, spent indefinite amounts of compute on one token. I suppose one could train a very deep Transformer, use CALM during inference and only use all of the layers for tokens which are difficult to predict, but this would still arbitrarily limit the maximum amount of compute per token.

tdgros t1_j41f1nz wrote on January 12, 2023 at 2:47 PM

#1,341,354

Replying to Chemont (#1,341,306)

You'll still pay the full price at train time, right? Early decoding works by using decoders on earlier levels at train time. Conversely, if you want to spend more on some tokens, at train time, you will need to have more layers, so at some point you will hit your memory/complexity limits.

tdgros t1_j41fn3f wrote on January 12, 2023 at 2:51 PM

#1,341,385

Replying to rehrev (#1,340,694)

At train time, you plug decoders at many levels with the same objective, you can find out if some things can be decoded earlier, using an additional network that outputs a sort of confidence. At inference time, you run the layers one by one, and stop when the confidence is high. which allows you to skip some computations. (It's probably a simplistic description, feel free to correct me)

FutureIsMine t1_j41l2ck wrote on January 12, 2023 at 3:28 PM

#1,341,668

There's albert which reuses the same layers throughout, I can see a case where albert is used, and a decoder thats a few neuros is also present, where at each step it will use a token in the input to determine if its time to stop, similarly reso net did this for Q&A

PassingTumbleweed t1_j41pibv wrote on January 12, 2023 at 3:57 PM

#1,341,880

Yes. This thread made me think of Universal Transformers which has dynamic halting and has been around for a while now: https://openreview.net/forum?id=HyzdRiR9Y7

[deleted] t1_j41r6kf wrote on January 12, 2023 at 4:08 PM

#1,341,970

[deleted]

Professor_Entropy t1_j41xfok wrote on January 12, 2023 at 4:47 PM

#1,342,314

Chain-of-thought prompting does this for the LM transformers. It can complete the harder objectives by using more computations.

I know it doesn't solve the general case, but you may take inspiration from it in other domains.

[deleted] t1_j42hduk wrote on January 12, 2023 at 6:49 PM

#1,343,308

Replying to tdgros (#1,341,385)

[removed]

Nameless1995 t1_j43ku48 wrote on January 12, 2023 at 10:48 PM

#1,345,254

Universal Transformer: https://arxiv.org/abs/1807.03819

Ponder Net: https://arxiv.org/abs/2107.05407

Deep Equilibrium Net: https://arxiv.org/abs/1909.01377

http://www.gatsby.ucl.ac.uk/~balaji/udl2021/accepted-papers/UDL2021-paper-072.pdf

Raphaelll_ t1_j45u38j wrote on January 13, 2023 at 11:03 AM

#1,349,099

Replying to PassingTumbleweed (#1,341,880)

Did this ever get any traction?

visarga t1_j46b2po wrote on January 13, 2023 at 1:50 PM

#1,349,975

Replying to Chemont (#1,341,306)

No but if you use a decoder model (autoregressive) you can generate more tokens for the same task, depending on its difficulty. Chain-of-thought makes use of this trick.

PassingTumbleweed t1_j46sco1 wrote on January 13, 2023 at 3:49 PM

#1,351,086

Replying to Raphaelll_ (#1,349,099)

That depends on what you mean. I don't think any of the LLMs use it, but it has some citations and follow-up literature.

cfoster0 t1_j4alveu wrote on January 14, 2023 at 9:29 AM

#1,358,739

FWIW in certain sense this goes against the design philosophy of transformers, which is to jointly compute all representations within a layer at once, to maximize the degree of parallelism on GPUs and other accelerators.