currentscurrents t1_j6o8ziy wrote on January 31, 2023 at 7:00 PM

Reply to comment by 5death2moderation in [D] What's stopping you from working on speech and voice? by jiamengial

Oof. Nvidia has a stranglehold on the market and they know it. I hope AMD steps up its game.

currentscurrents t1_j6m3ik5 wrote on January 31, 2023 at 8:02 AM

Reply to comment by MysteryInc152 in [N] OpenAI has 1000s of contractors to fine-tune codex by yazriel0

We could make models with trillions of parameters, but we wouldn't have enough data to train them. Multimodality definitely allows some interesting things but all existing multimodal models still require billions of training examples.

More efficient architectures must be possible - evolution has probably discovered one of them.

currentscurrents t1_j6m2ljf wrote on January 31, 2023 at 7:50 AM

Reply to comment by tripple13 in [D] What's stopping you from working on speech and voice? by jiamengial

Is the H100 even out yet?

High hopes that it pushes down the cost of older chips like the A100.

currentscurrents t1_j6jbokk wrote on January 30, 2023 at 7:09 PM

Reply to [Discussion] ChatGPT and language understanding benchmarks by mettle

I think hallucination occurs because of the next-word-prediction task on which these models were trained. No matter how good a model is, it can never predict the irreducible entropy of the sentence - the 1.5 bits per word or whatever that contains the actual information content. The best it can do is guess.

This is exactly what hallucination looks like; all the sentence structure is right, but the information is wrong. Unfortunately, this is also the most important part of the sentence.

currentscurrents t1_j6e4get wrote on January 29, 2023 at 6:32 PM

Reply to comment by vivehelpme in [N] OpenAI has 1000s of contractors to fine-tune codex by yazriel0

I rounded. Data collection is like astronomy, it's the order of magnitude that matters.

currentscurrents t1_j6btqta wrote on January 29, 2023 at 5:16 AM

Reply to comment by visarga in [N] OpenAI has 1000s of contractors to fine-tune codex by yazriel0

Frankly though, there's got to be a way to do with less data. The typical human brain has heard maybe a million words of english and about 8000 hrs of video per year of life. (and that's assuming dreams are generative training data somehow - halve that if you only get to count the waking world)

We need something beyond transformers. They were a great breakthrough in 2018, but we're not going to get to AGI just by scaling them up.

currentscurrents OP t1_j69u2gb wrote on January 28, 2023 at 8:06 PM

Reply to comment by Lord_of_Many_Memes in [D] Could forward-forward learning enable training large models with distributed computing? by currentscurrents

> I tried that on gpt and wikitext it just doesn’t converge on real problems

Would you be able to share your code? How were you generating negative data?

currentscurrents OP t1_j67lie8 wrote on January 28, 2023 at 7:53 AM

Reply to comment by master3243 in [D] Could forward-forward learning enable training large models with distributed computing? by currentscurrents

I'm messing around with it to try to scale to a non-toy problem, maybe try to adapt it to one of the major architectures like CNNs or transformers. I'm not sitting on a ton of compute though, it's just me and my RTX 3060.

A variant paper, Predictive Forward-Forward, claims performance equal to backprop. They operate the model in a generative mode to create the negative data.

currentscurrents OP t1_j674tf3 wrote on January 28, 2023 at 4:41 AM

Reply to comment by Red-Portal in [D] Could forward-forward learning enable training large models with distributed computing? by currentscurrents

They have some downsides though. HOGWILD! requires a single shared memory, and horovod requires every machine to have a copy of the entire model.

A truly local training method would mean your model could be as big as all the machines put together. The order of magnitude in size increase could outweigh the poorer performance of forward-forward learning.

No idea how you'd handle them coming and going, you'd have to dynamically resize the network somehow - there are still other unsolved problems before we could have a GPT@home.

currentscurrents OP t1_j658kmf wrote on January 27, 2023 at 8:18 PM

Reply to comment by cthorrez in [R] Why Can GPT Learn In-Context? Language Models Secretly Perform Gradient Descent as Meta-Optimizers by currentscurrents

Interesting. That probably explains why ICL outperformed finetuning by so much in their experiments.

currentscurrents t1_j657n2z wrote on January 27, 2023 at 8:12 PM

Reply to comment by blimpyway in [R] The Predictive Forward-Forward Algorithm by radi-cho

>The so called NPUs. Which are simplified GPUs optimized only for inference (forward passes). Such an algorithm would enable them to learn using only forward passes, hence without requiring backpropagation.

More importantly, you could build even simpler chips that physically implement a neural network out of analog circuits instead of emulating one with digital math.

This would use orders of magnitude less power, and also let you fit a larger network on the same amount of die space.

currentscurrents OP t1_j62auto wrote on January 27, 2023 at 5:02 AM

Reply to comment by rjromero in [R] Why Can GPT Learn In-Context? Language Models Secretly Perform Gradient Descent as Meta-Optimizers by currentscurrents

Yes, but I don't want to create too much optimism; meta-learning was also a promising lead when Schmidhuber wrote his PhD thesis.

Honestly, I'm not sure much has changed since then other than we got more compute power. Transformers are reportedly equivalent to 1990s meta-learning networks except that they run better on GPUs, and GPUs have gotten powerful enough to run them at very large scale.

currentscurrents OP t1_j627rd0 wrote on January 27, 2023 at 4:34 AM

Reply to comment by master3243 in [R] Why Can GPT Learn In-Context? Language Models Secretly Perform Gradient Descent as Meta-Optimizers by currentscurrents

Meh, transformers have been around for like 5 years and nobody figured this out until now.

I think this mostly speaks to how hard it is to figure out what neural networks are doing. Complexity is irrelevant to the training process (or any other optimization process), so the algorithms they implement are arbitrarily complex.

(or in practice, as arbitrarily complex as the model size and dataset size allow)

currentscurrents OP t1_j623hb4 wrote on January 27, 2023 at 3:57 AM

Reply to comment by robdogcronin in [R] Why Can GPT Learn In-Context? Language Models Secretly Perform Gradient Descent as Meta-Optimizers by currentscurrents

Yeah, but I want AI now. Not in 40 years when computers are 1000x better.

Also I'm not sure computers will be 1000x better in 40 years, Moore's law isn't what it used to be.

currentscurrents OP t1_j620shg wrote on January 27, 2023 at 3:35 AM

Reply to comment by VisceralExperience in [R] Why Can GPT Learn In-Context? Language Models Secretly Perform Gradient Descent as Meta-Optimizers by currentscurrents

What? "Meta-optimization" is not a very anthropomorphic term, and certainly not something laymen would understand. Their approach is technical in nature and describes the limitations of current models in explicit detail.

currentscurrents OP t1_j61ndkl wrote on January 27, 2023 at 1:53 AM

Reply to comment by lucidraisin in [R] Why Can GPT Learn In-Context? Language Models Secretly Perform Gradient Descent as Meta-Optimizers by currentscurrents

Thanks for the link!

I think it's interesting that they spent so much time in the 90s trying to make meta-learning work, and now it appears emergently just from throwing scale at the problem.

currentscurrents OP t1_j608oz5 wrote on January 26, 2023 at 8:12 PM

Reply to [R] Why Can GPT Learn In-Context? Language Models Secretly Perform Gradient Descent as Meta-Optimizers by currentscurrents

TL;DR:

In-context learning (ICL) is the ability of language models to "learn from example" to perform new tasks just based on prompting. These researchers are studying the mechanism behind ICL.
They show that the attention layers allow transformers to implement a gradient descent optimization process at inference time. This mechanism produces very similar results to explicit optimization through fine-tuning, but was itself learned by optimization through gradient descent.
Based on this finding they apply momentum, a technique known to improve optimizers, to transformer attention layers. This produces a small-but-consistent improvement in performance on all tested tasks. They suggest that there are more improvements to be made by explicitly biasing transformers towards meta-optimization.

This reminds me of some meta-learning architectures that try to intentionally include gradient descent as part of inference (https://arxiv.org/abs/1909.04630) - the difference here is that LLMs somehow learned this technique during training. The implication is pretty impressive: at enough scale, meta-learning just emerges by itself because it's a good solution to the problem.

Other researchers are looking into ICL as well, here's another recent paper on the topic: https://arxiv.org/abs/2211.15661

currentscurrents t1_j5xnyrw wrote on January 26, 2023 at 7:12 AM

Reply to comment by mudkip-hoe in Machine learning and black box numerical solver[D] by Due-Wall-915

Link for the lazy: https://arxiv.org/abs/1806.07366

currentscurrents t1_j5uvgk8 wrote on January 25, 2023 at 6:58 PM

Reply to comment by Kamimashita in [D]Are there any known AI systems today that are significantly more advanced than chatGPT ? by Xeiristotle

Not at this time. Google says they're going to release some kind of LLM-based product this year though.

currentscurrents t1_j5b6jf2 wrote on January 21, 2023 at 6:54 PM

Reply to [R] New Tsetlin machine learning scheme creates up to 80x smaller logical rules, benefitting hardware efficiency and interpretability. by olegranmo

Interesting! I think it's good to remember that the important part of neural networks is the optimization-based learning process - you can run optimization on things other than neural networks. Like how plenoxels got 100x speedup over NeRF by running optimization on a structure more naturally suited to 3D voxel data.

I do wonder how scalable TMs are to less toy tasks though. MINST is pretty easy in 2023, and I think you can solve the BBC Sports dataset just by looking for keywords.

currentscurrents t1_j57hol4 wrote on January 20, 2023 at 10:58 PM

Reply to Google to relax AI safety rules to compete with OpenAI by Surur

Good. Most "AI safety" I've seen has been political activists whining about things they don't understand.

currentscurrents t1_j573tug wrote on January 20, 2023 at 9:25 PM

Reply to [D] Did YouTube just add upscaling? by Avelina9X

They announced upscaling support in Chrome at CES 2023.

>The new feature will work within the Chrome and Edge browsers, and also requires an Nvidia RTX 30-series or 40-series GPU to function. Nvidia didn't specify what exactly is required from those two GPU generations to get the new upscaling feature working, nor if there's any sort of performance impact, but at least this isn't a 40-series only feature.

Interesting though that it's working with your GTX 1660 Ti. Maybe Chrome is implementing a simpler upscaler as a fallback for older GPUs?

Check your chrome://flags for anything that looks related.

currentscurrents t1_j525hto wrote on January 19, 2023 at 9:33 PM

Reply to comment by hapliniste in [D] is it time to investigate retrieval language models? by hapliniste

Retrieval language models do have some downsides. Keeping a copy of the training data around is suboptimal for a couple reasons:

Training data is huge. Retro's retrieval database is 1.75 trillion tokens. This isn't a very efficient way of storing knowledge, since a lot of the text is irrelevant or redundant.
Training data is still a mix of knowledge and language. You haven't achieved separation of the two types of information, so it doesn't help you perform logic on ideas and concepts.
Most training data is copyrighted. It's currently legal to train a model on copyrighted data, but distributing a copy of the training data with the model puts you on much less firm ground.

Ideally I think you want to condense the knowledge from the training data down into a structured representation, perhaps a knowledge graph. Knowledge graphs are easy to perform logic on and can be human-editable. There's also already an entire sub-field studying them.

currentscurrents t1_j4s2n9t wrote on January 17, 2023 at 9:36 PM

Reply to comment by _Arsenie_Boca_ in [P] RWKV 14B Language Model & ChatRWKV : pure RNN (attention-free), scalable and parallelizable like Transformers by bo_peng

It looks like he goes into a lot more detail on his github.

currentscurrents t1_j4rcc3e wrote on January 17, 2023 at 6:54 PM

Reply to [P] RWKV 14B Language Model & ChatRWKV : pure RNN (attention-free), scalable and parallelizable like Transformers by bo_peng

Interesting! I haven't heard of RWKV before.

Getting rid of attention seems like a good way to increase training speed (since training all those attention heads at once is slow), but how can it work so well without attention?

Also aren't RNNs usually slower than transformers because they can't be parallelized?