currentscurrents

currentscurrents t1_j2uwlrh wrote

I think interpretability will help us build better models too. For example, in this paper they deeply analyzed a model trained to do a toy problem - addition mod 113.

They found that it was actually working by doing a Discrete Fourier Transform to turn the numbers into sinewaves. Sinewaves are great for gradient descent because they're easily differentiable (unlike modular addition on the natural numbers, which is not differentiable), and if you choose the right frequency it'll repeat every 113 numbers. The modular addition algorithm worked by doing a bunch of addition and multiplication operations on these sinewaves, which gave the same result as modular addition.

This lets you answer an important question; why wasn't the network generalizable to other bases other than mod 113? Well, the frequency of the sinewaves was hardcoded into the network, so it couldn't work for any other bases.

The opens the possibility to do neural network surgery, and change the frequency to work with any base.

34

currentscurrents t1_j2trd40 wrote

If only it could run on a card that doesn't cost as much as a car.

I wonder if we will eventually hit a wall where more compute is required for further improvement, and we can only wait for GPU manufacturers. Similar to how they could never have created these language models in the 80s, no matter how clever their algorithms - they just didn't have enough compute power, memory, or the internet to use as a dataset.

5

currentscurrents OP t1_j2hdsvv wrote

Someone else posted this example, which is kind of what I was interested in. They trained a neural network to do a toy problem, addition mod 113, and then were able to determine the algorithm it used to compute it.

>The algorithm learned to do modular addition can be fully reverse engineered. The algorithm is roughly:

>Map inputs x,y→ cos(wx),cos(wy),sin(wx),sin(wy) with a Discrete Fourier Transform, for some frequency w.

>Multiply and rearrange to get cos(w(x+y))=cos(wx)cos(wy)−sin(wx)sin(wy) and sin(w(x+y))=cos(wx)sin(wy)+sin(wx)cos(wy)

>By choosing a frequency w=2πnk we get period dividing n, so this is a function of x + y (mod n)

>Map to the output logits z with cos(w(x+y))cos(wz)+sin(w(x+y))sin(wz)=cos(w(x+y−z)) - this has the highest logit at z≡x+y(mod n), so softmax gives the right answer.

>To emphasise, this algorithm was purely learned by gradient descent! I did not predict or understand this algorithm in advance and did nothing to encourage the model to learn this way of doing modular addition. I only discovered it by reverse engineering the weights.

This is a very different way to do modular addition, but it makes sense for the network. Sine/cosine functions represent waves that repeat every frequency, so if you choose the right frequency you can implement the non-differentiable modular addition function just working with differentiable functions.

Extracting this algorithm is useful for generalization; while the original network only worked for mod 113, with the algorithm we can plug in any value for the frequency. Of course this is a toy example and there are much faster ways to do modular addition, but maybe it could work for more complex problems too.

6

currentscurrents OP t1_j2h21zn wrote

>For example, a sort function is necessarily general over the entire domain of the entries for which it is valid, whereas a neural network will only approximate a function over the subfield of the domain for which it was trained, all bets are off elsewhere; it doesn't generalize.

But you can teach neural networks to do things like solve arbitrary mazes. Isn't that pretty algorithmic?

1

currentscurrents OP t1_j2gdumb wrote

By classical algorithm, I mean something that doesn't use a neural network. Traditional programming and neural networks are two very different ways to solve problems, but they can solve many of the same problems.

That sounds like a translation problem, which neural networks are good at. Just like in translation, it would have to understand the higher-level idea behind the implementation.

It's like text-to-code, but network-to-code instead.

3

currentscurrents OP t1_j2g9mvy wrote

Thanks, that is the question I'm trying to ask! I know explainability is a bit of a dead-end field right now so it's a hard problem.

An approximate or incomprehensible algorithm could still be useful if it's faster or uses less memory. But I think to accomplish that you would need to convert it into higher-level ideas; otherwise you're just emulating the network.

Luckily neural networks are capable of converting things into higher-level ideas? It doesn't seem fundamentally impossible.

3

currentscurrents OP t1_j2g5x92 wrote

Thanks for the link, that's good to know about!

But maybe I should have titled this differently. I'm interested in taking a network that solves a problem networks are good at, and converting it into a code representation as a way to speed it up. Like translating between between the two different forms of computation.

1

currentscurrents t1_j2f996k wrote

Attention maps can be a type of explanation.

It tells you what the model was looking at when it generated a word or identified an image, but it doesn't tell you why it looked at those bits or why it made the decision it did. You can get some useful information by looking at it, but not everything you need to explain the model.

10

currentscurrents t1_j2csenb wrote

The number of layers is a hyperparameter, and people do optimization to determine the optimal values for hyperparameters.

Model size does seem to be a real scaling law. It's possible that we will come up with better algorithms that work on smaller models, but it's also possible that neural networks need to be big to be useful. With billions of neurons and an even larger number of connections/parameters, the human brain is certainly a very large network.

3

currentscurrents t1_j2cm36p wrote

TL;DR they want to take another language model (Google’s PaLM) and do Reinforcement Learning with Human Feedback (RLHF) on it like OpenAI did for ChatGPT.

At this point they haven't actually done it yet, since they need both compute power and human volunteers to do the training:

>Human volunteers will be employed to rank those responses from best to worst, using the rankings to create a reward model that takes the original model’s responses and sorts them in order of preference, filtering for the top answers to a given prompt.

>However, the process of aligning this model with what users want to accomplish with ChatGPT is both costly and time-consuming, as PaLM has a massive 540 billion parameters. Note that the cost of developing a text-generating model with only 1.5 billion parameters can reach up to $1.6 million.

Since it has 540b parameters, you will still need a GPU cluster to run it.

81

currentscurrents t1_j2by81g wrote

So, if I'm understanding right:

  • Backwards chaining is an old classical algorithm for logic proving.

  • They've implemented backwards chaining using a bunch of language models, so it works well with natural text.

  • Given a knowledge base (which are available as datasets these days), it can decompose a statement and check if it's logically consistent with that knowledge.

  • The reason they're interested in this is to use it as a training function to make language models more accurate.

This is effectively an old "expert system" from the 70s built out of neural networks. I wonder what other classical algorithms you can implement with neural networks.

I also wonder if you could use this to create its own knowledge base from internet data. Since the internet is full of contradicting information, you would have to compare new data against existing data somehow and decide which to keep.

8