farmingvillein t1_jbkx0co wrote on March 9, 2023 at 7:53 PM

Reply to comment by LetterRip in [D] Why isn't everyone using RWKV if it's so much better than transformers? by ThePerson654321

I don't understand the relevance here--tape-RNNs != RWKV, unless I misunderstand the RWKV architecture (certainly possible).

farmingvillein t1_jbkwkgl wrote on March 9, 2023 at 7:50 PM

Reply to comment by ThePerson654321 in [D] Why isn't everyone using RWKV if it's so much better than transformers? by ThePerson654321

> most extraordinary claim I got stuck up on was "infinite" ctx_len.

All RNNs have that capability, on paper. But the question is how well does the model actually remember and utilize things that happened a long time ago (things that happened beyond the the window that a transformer has, e.g.). In simpler RNN models, the answer is usually "not very".

Which doesn't mean that there can't be real upside here--just that it is not a clear slam-dunk, and that it has not been well-studied/ablated. And obviously there has been a lot of work in extending transformer windows, too.

farmingvillein t1_jbk819k wrote on March 9, 2023 at 5:18 PM

Reply to comment by ThePerson654321 in [D] Why isn't everyone using RWKV if it's so much better than transformers? by ThePerson654321

I think it is more likely people have seen it, but dismissed it as a bit quixotic, because the RWKV project has made little effort to iterate in an "academic" fashion (i.e., with rigorous, clear testing, benchmarks, goals, comparisons, etc.). It has obviously done pieces of this, but hasn't been sufficiently well-defined as to make it easy for others to iterate on top of it, from a research POV.

This means that anyone else picking up the architecture is going to have to go through the effort to create the whole necessary research baseline. Presumably this will happen, at some point (heck, maybe someone is doing it right now), but it creates a large impediment to further iteration/innovation.

farmingvillein t1_jbk6nut wrote on March 9, 2023 at 5:09 PM

Reply to [D] Why isn't everyone using RWKV if it's so much better than transformers? by ThePerson654321

> Based on my comprehension of this model, it appears to offer a distinct set of advantages relative to transformers

What advantages are you referring to, very specifically?

There are theoretical advantages--but it can be a lot of work to prove out that those matter.

There are (potentially) empirical, observed advantages--but there don't seem to be (yet) any claims that are so strong as to suggest a paradigm shift (like Transformers were).

Keep in mind that there is a lot of infrastructure built up to support transformers in an industrial context, which means that even if RWKV shows some small advantage, that the advantage may not be there in practice, because of all the extreme optimizations that have been built to support larger organizations (in speed of inference, training, etc.).

The most likely adoption path here would be if multiple papers showed, at smaller scale, consistent advantages for RWKV. No one has done this yet--and the performance metrics provided on the github (https://github.com/BlinkDL/RWKV-LM) certainly don't make such an unequivocal claim on performance.

And providing a rigorous side-by-side comparison with transformers is actually really, really hard--apples to apples comparisons are notoriously tricky, and you of course have to be really cautious about thinking about what "tips and tricks" you allow both architectures to leverage.

Lastly, and this is a fuzzier but IMO I think relevant point--

The biggest guys are crossing into a point where evaluation is suddenly hard again.

By that, what I mean is that there is broad consensus that our current public evaluation metrics don't do a great job of helping us understand how well these models perform on "more interesting" generative tasks. I think you'll probably see some major improvements around eval/benchmark management in the next year or so (and certainly, internally, the big guys have invested a lot here)--but for now, it is harder to pick up a new architecture/model and understand its capabilities in the "more interesting" capabilities that your GPT-4s & Bards of the world are trying to demonstrate. This makes it harder to prove and vet progress on smaller models, which of course makes scaling up more risky.

farmingvillein t1_jbk47jg wrote on March 9, 2023 at 4:54 PM

Reply to comment by LetterRip in [D] Why isn't everyone using RWKV if it's so much better than transformers? by ThePerson654321

> I emailed them that RWKV exactly met their desire for a way to train RNNs 'on the whole internet' in a reasonable time. > > So prior to a month ago they didn't know it existed or happened to meet their use case.

How does #2 follow from #1?

RWKV has been on reddit for quite a while, and a high number of researchers frequent/lurk on reddit, including Deepmind researchers, so the idea that they had no idea that RWKV exists seems specious.

Unless you mean that you emailed them and they literally told you that they didn't know about this. In which case...good on you!

farmingvillein t1_jbk3esu wrote on March 9, 2023 at 4:49 PM

Reply to comment by blarg7459 in [D] Can someone explain the discrepancy between the findings of LLaMA and Chinchilla? by __Maximum__

Yes, which was arguably the key claim of the LLaMa paper.

farmingvillein t1_jbk2uyw wrote on March 9, 2023 at 4:45 PM

Reply to comment by currentscurrents in [D] Can someone explain the discrepancy between the findings of LLaMA and Chinchilla? by __Maximum__

> Nobody ever does this though because of diminishing returns.

Extending the LLaMa concept, I would love to see someone like Meta run the experiment where they do take their 1.4T (or w/e) tokens, and run training to convergence...on the largest model that will converge (subject to reasonable LR decay policies) in a "reasonable" time frame.

Meaning, if they trained, say, a 1M param LLM...presumably it would hit convergence (get saturated) pretty quickly. And what about 10M, 100M, etc.?

I.e., how much more can we squeeze out of a relatively-tiny model? Probably it doesn't end up super interesting from a purely generative POV, but it might look like--e.g.--Roberta+.

With a model that is so small, the cost to run this test probably(?) wouldn't be that high.

farmingvillein t1_jbk1pv7 wrote on March 9, 2023 at 4:38 PM

Reply to [D] What is the best way to fine tune a LLM with your own data and build a custom text classifier? by pgalgali

> What is the best way to build a custom text classifier leveraging your own data?

"Best" is subjective, but if you are truly new, check out huggingfaces--it will probably be "easiest" (and still high quality), which is what you need as a beginner.

> Also what is the best starting LLM for this purpose- smaller model like Roberta or larger ones like GPT?

Really depends on how much training hardware you have, and how important it is to be "the best".

Roberta is probably going to be the best starting point, from an effort:return perspective.

The above all said--

The other thing I'd encourage you to do is to start by just exploring text classification without doing any custom training. Simply take a couple open source LLMs off the shelf (gpt-turbo and FLAN-T5-XXL being obvious ones), experiment with how to prompt them well, and evaluate results from there.

This will probably be even faster than training something custom, and will give you a good baseline--even if the cost is higher than you want to pay in production, it will help you understand what behavior can look like, and the inference dollars you pay will likely be a fraction of any production training/inference costs.

If, e.g., you get 60% F1 with a "raw" LLM, then you can/should expect Roberta (assuming you have decent training data) to probably be somewhere (and this is an extremely BOE estimate; reality can be quite different, of course) around that. If you then go and train a Roberta model and get, say, 30%, then you probably did something wrong--or the classification process requires a ton of nuance that is actually really hard, and you really should consider baselining on LLMs.

Good luck!

The biggest takeaway you should have, as a beginner:

Figure out what lets you get every step of results fastest, and prioritize that. Experimentation is still very much key in this field.

farmingvillein t1_jb18evq wrote on March 5, 2023 at 5:40 PM

Reply to comment by Toast119 in [R] [N] Dropout Reduces Underfitting - Liu et al. by radi-cho

Yes. In the first two lines of the abstract:

> Introduced by Hinton et al. in 2012, dropout has stood the test of time as a regularizer for preventing overfitting in neural networks. In this study, we demonstrate that dropout can also mitigate underfitting when used at the start of training.

farmingvillein t1_japqcq1 wrote on March 3, 2023 at 3:58 AM

Reply to comment by Timdegreat in [D] OpenAI introduces ChatGPT and Whisper APIs (ChatGPT API is 1/10th the cost of GPT-3 API) by minimaxir

> But wouldn't the ChatGPT embeddings still be better? Given that they're cheap, why not use the better option?

Usually, to get the best embeddings, you need to train them somewhat differently than you do a "normal" LLM. So ChatGPT may not(?) be "best" right now, for that application.

farmingvillein t1_jajw0yj wrote on March 1, 2023 at 11:17 PM

Reply to comment by badabummbadabing in [D] OpenAI introduces ChatGPT and Whisper APIs (ChatGPT API is 1/10th the cost of GPT-3 API) by minimaxir

> The training costs lie in the low millions (10M was the cited number for GPT3), which is a joke compared to the startup costs of many, many industries. So while this won't be something that anyone can train, I think it's more likely that there will be a few big players (rather than a single one) going forward.

Yeah, I think there are two big additional unknowns here:

How hard is it to optimize inference costs? If--for sake of argument--for $100M you can drop your inference unit costs by 10x, that could end up being a very large and very hidden barrier to entry.
How much will SOTA LLMs really cost to train in, say, 1-2-3 years? And how much will SOTA matter?

The current generation will, presumably, get cheaper and easier to train.

But if it turns out that, say, multimodal training at scale is critical to leveling up performance across all modes, that could jack up training costs really, really quickly--e.g., think the costs to suck down and train against a large subset of public video. Potentially layer in synthetic data from agents exploring worlds (basically, videogames...), as well.

Now, it could be that the incremental gains to, say, language are not that high--in which case the LLM (at least as these models exist right now) business probably heavily commoditizes over the next few years.

farmingvillein t1_jajtmly wrote on March 1, 2023 at 11:01 PM

Reply to comment by VertexMachine in [D] OpenAI introduces ChatGPT and Whisper APIs (ChatGPT API is 1/10th the cost of GPT-3 API) by minimaxir

> Plus if it's a price war... with Google.. that would be stupid

If it is a price war strategy...my guess is that they're not worried about Google.

Or, put another way, if it is Google versus OpenAI, openai is pretty happy about the resulting duopoly. Crushing everyone else in the womb, though, would be valuable.

farmingvillein t1_jadt897 wrote on February 28, 2023 at 6:43 PM

Reply to comment by deliciously_methodic in [R] Microsoft introduce Kosmos-1, a Multimodal Large Language Model (MLLM) that can perceive general modalities, learn in context (i.e., few-shot), and follow instructions (i.e., zero-shot) by MysteryInc152

FWIW, I was trying to make a more subtle point than OP's response--see my other reply.

farmingvillein t1_jadqg1l wrote on February 28, 2023 at 6:26 PM

Reply to comment by MysteryInc152 in [R] Microsoft introduce Kosmos-1, a Multimodal Large Language Model (MLLM) that can perceive general modalities, learn in context (i.e., few-shot), and follow instructions (i.e., zero-shot) by MysteryInc152

You're missing the point here, or I wasn't clear--the question isn't whether performance will improve with more params (and potentially) data; no doubt there.

The question is whether a model trained at scale on text & images will outperform a model trained at scale solely on text, in the text-only domain (or similarly, the image-only).

To-date, all* of the public research (and Kosmos is no different) on multimodal models have showed, at best, multimodal models generally performing equal to unimodal variants in unimodal domains. And often they are a shade worse (like Kosmos).

(*=unless you count code+natural language.)

The holy grail, of course, is that the two help one another, so that your multimodal variant outperforms the unimodal variants on unimodal tasks. GPT-* gets better at talking to you because it has ingested all of the Youtube videos in the world, e.g.

If you can demonstrate that (and it certainly makes intuitive human sense that this could/should be true), then of course there is a giant truckload of image (including video!) and audio data you can slam into your text models to make text-based scenarios better (and similarly for images, etc.). (And it also more plausibly suggests that massive amounts of synthetic world exploration data could be accretive, too...)

There is a bunch of research (https://arxiv.org/abs/2301.03728 being one of the most exciting) suggesting that this can occur, with enough data/params, but no one has publicly demonstrated it. (And it'd surprise no one, probably, if this was part of GPT-4's or Gato-2's mix.)

farmingvillein t1_jacq4fn wrote on February 28, 2023 at 2:28 PM

Reply to [R] Microsoft introduce Kosmos-1, a Multimodal Large Language Model (MLLM) that can perceive general modalities, learn in context (i.e., few-shot), and follow instructions (i.e., zero-shot) by MysteryInc152

The language-only performance was pretty meh, comparing the versions with and without images. We'll have to see whether scale up helps here (other research suggests yes?... But still need to see proof).

farmingvillein t1_j90m0ab wrote on February 18, 2023 at 8:52 AM

Reply to comment by adt in [D] Compare open source LLMs by President_Xi_

> For models, see my up-to-date list of models:

Which tab is germane to OP's request?

> but I am specifically refering to performance after finetuning.

So far as I can tell, there is nothing here that is responsive to OP's query. But there is a lot here--perhaps I read too quickly.

farmingvillein t1_j8s7ygo wrote on February 16, 2023 at 4:12 PM

Reply to comment by gwern in [R] RWKV-4 14B release (and ChatRWKV) - a surprisingly strong RNN Language Model by bo_peng

This...is pretty astounding. Just have the grace to admit you were wrong, and move on.

> Telling someone to read the Related Works section of every one of a dozen papers in the Related Works section of a paper is a ridiculous thing to suggest

Then how can you possibly say:

> I don't think the Related Works section of that paper provides any useful references.

?

This is hardcore trolling. You can, and frequently do, do better than this.

You are literally pushing posts that are factually incorrect, and that you either know are factually incorrect, or are too lazy to validate either way.

This is the type of thing which blows up post quality in this sub.

> Giving someone a random reference and telling them to manually crawl the literature is not helpful.

This...is ridiculous. This is--traditionally--a very academic-friendly sub. This is how research works. "Here is where you can start a literature review on a bundle of related papers" is an extremely classic response which is generally considered helpful to complex and nuanced questions.

And underlying issue is actually very complex, as evidenced in part by the fact that your references do not actually answer the question. "Go read related works" can be obnoxious when there are a single one or two papers that do answer the question--but that is not the case here.

> In contrast, the two references I provided directly bore on the question

No they did not. They did not touch at all upon Transformers versus RNNs, which was the question. You've chosen to cherry-pick one slice of the problem and declare victory.

> It's not a strawman.

You don't seem to understand what a strawman is. Strawman:

> an intentionally misrepresented proposition that is set up because it is easier to defeat than an opponent's real argument.

I was not making this argument. You were making this argument. QED, this a strawman.

farmingvillein t1_j8qj1u7 wrote on February 16, 2023 at 5:43 AM

Reply to comment by bo_peng in [R] RWKV-4 14B release (and ChatRWKV) - a surprisingly strong RNN Language Model by bo_peng

> RWKV is the exception. When you look at loss against token position, it is comparable with transformers.

Can you link to what you are referring to? If I missed it in the OP post, my apologies.

farmingvillein t1_j8qipd4 wrote on February 16, 2023 at 5:40 AM

Reply to comment by gwern in [R] RWKV-4 14B release (and ChatRWKV) - a surprisingly strong RNN Language Model by bo_peng

Let's think step by step:

You:

> I don't think the Related Works section of that paper provides any useful references.

Your own response to the question that was posed:

> https://arxiv.org/abs/1805.04623 > https://arxiv.org/abs/1702.04521

There is no possible way that you actually read the Related Works section you dismissed, given that the papers you cited are already covered in the same references you dismissed.

E.g., "Sharp Nearby, Fuzzy Far Away" is directly discussed in the cited "Transformer-XL":

> Empirically, previous work has found that LSTM language models use 200 context words on average (Khandelwal et al., 2018), indicating room for further improvement

> Simply comparing RNNs with and RNNs without memory doesn't tell you anything about how fast the memory fades out and that it never winds up being bigger than a Transformer

I never said this, so I'm not sure what your argument is.

> we know perfectly well that Transformers make excellent use of context windows larger than 50 or 200 tokens (as my two references show)

Neither of the papers you link to (assuming you are talking about your own comment at https://www.reddit.com/r/MachineLearning/comments/1135aew/r_rwkv4_14b_release_and_chatrwkv_a_surprisingly/j8pg3g7/) make any reference to Transformers.

If your claim is that the papers indicated that RNNs have a small window (sure) and that Transformers have a longer one, you're arguing (as you seem to be in your entire post) again against a strawman. Re-read what I actually wrote:

> in practice, their effective "context window" often doesn't look much different than a reasonable transformer, when we look at performance metrics against long sequences.

My statement here is an empirical one around performance--which, among other things, is why I reference Dai et al, who (among others!) do a fairly extensive breakdown of empirical performance differences of RNNs- versus transformer-type architectures against long text sequences.

The whole point is that an OP said that RNNs were attractive because of the theoretical infinite context--but my response was that 1) we don't really see that in practice, when we try to measure it directly (as both of our sources point out), and 2) we don't see evidence of superior long-distance behavior when testing against real-world(ish) data sets that should theoretically reward that. And that both of these points are encapsulated if you follow the reference I shared (or, as I noted, most reasonable "long-distance transformer" papers).

(As with all things research...someone may come out with a small modification tomorrow that invalidates everything above--but, for now, it represents the broad public (i.e., non-private) understanding of architecture behaviors.)

farmingvillein t1_j8pni5v wrote on February 16, 2023 at 1:20 AM

Reply to comment by gwern in [R] RWKV-4 14B release (and ChatRWKV) - a surprisingly strong RNN Language Model by bo_peng

Neither of these offer a comparative look against transformers, although they are certainly a useful look against the limitations of your basic RNN/LSTM.

farmingvillein t1_j8piz80 wrote on February 16, 2023 at 12:46 AM

Reply to comment by gwern in [R] RWKV-4 14B release (and ChatRWKV) - a surprisingly strong RNN Language Model by bo_peng

Not clear to me what you are looking for here.

> It simply provides doodads people claim help memory without papers showing that the memory doesn't work.

The very first reference I pulled, Graves 2014, specifically compares w/ and w/o memory.

Or Dai et al, which tries to compare against various RNN-style baselines with similar parameters.

Perhaps we're talking past each other?

farmingvillein t1_j8p7qa8 wrote on February 15, 2023 at 11:23 PM

Reply to comment by maizeq in [R] RWKV-4 14B release (and ChatRWKV) - a surprisingly strong RNN Language Model by bo_peng

Any of the papers that address building NLP for long contexts will tend to have a relevant related works section. E.g., https://arxiv.org/pdf/2109.00301.pdf.

(The one qualifier here is that, at "modern" scale, RNNs have not really been well-tested (since people tend to just use...transformers). So, maaaybe they are actually simply superior. Evidence so far says "doubtful", however (at least for more vanilla implementations).)

farmingvillein t1_j8p7lci wrote on February 15, 2023 at 11:22 PM

Reply to comment by csreid in [R] RWKV-4 14B release (and ChatRWKV) - a surprisingly strong RNN Language Model by bo_peng

Neither really work for super long contexts, so it is kind of a moot point.

Both--empirically--end up with bolt-on approaches to enhance memory over very long contexts, so it isn't really clear (a priori) that the RNN has a true advantage here.

farmingvillein t1_j8p269l wrote on February 15, 2023 at 10:44 PM

Reply to comment by MysteryInc152 in [R] RWKV-4 14B release (and ChatRWKV) - a surprisingly strong RNN Language Model by bo_peng

> I hope more catch on because the lack of a limited context length is a game changer.

I'd be cautious about concluding this, without more testing.

RNNs, in some theoretical sense, support infinite context more easily than N^2 transformers; in practice, their effective "context window" often doesn't look much different than a reasonable transformer, when we look at performance metrics against long sequences.

farmingvillein t1_j8ftdg9 wrote on February 14, 2023 at 12:03 AM

Reply to [D] Is a non-SOTA paper still good to publish if it has an interesting method that does have strong improvements over baselines (read text for more context)? Are there good examples of this kind of work being published? by orangelord234

Some helpful gut checks:

Do you have reason to believe that your method will scale (with parameters and data)? Maybe (probably) you can't actually test things at Google scale--but if you have good theoretical reasons to believe that your method would be accretive at scale, that is a major +.

Yes, getting things to run really well at small scale can be of (sometimes extreme!) value--but you're simply going to see less interest from reviewers on its own. There have been a bazillion hacky ML methods that turn out to be entirely irrelevant once you scale up substantially, and people are wary of such papers/discussions.

If you've got to go down this path, then make sure to position it explicitly as hyper-optimizing small-scale models (like for mobile).)

Do you have good reasons to believe that the "top" paper plus your method would further boost SOTA? Even better, can you test it to confirm?

If your method is--at its theoretical core--simply a twist on a subset of the methods from that SOTA used, then you're going to see much less paper interest, unless you can promise significant improvements in simplicity/efficiency.

> But this "SOTA" paper uses some methods that just don't seem practical for applications at all.

Can you demonstrate the superiority of your method on some of these other applications? So that you can, e.g., create an SOTA in some sort of subset? That can be helpful.