JustOneAvailableName

JustOneAvailableName t1_izbzbaq wrote

Because Schmidhuber claiming that transformers are based on his work was a meme for 3-4 years before he actually did that. Like here.

There are hundreds more relevant papers to cite and read about (linear scaling) transformers

2

JustOneAvailableName t1_izbnfki wrote

> What did he claim that he didn't achieve?

Connections to his work are often vague. Yes, his lab tried something in the same extremely general direction. No, his lab did not show it actually worked or what part of the broad direction they went in actually worked. So I am not gonna cite Fast Weight Programmers when I want to write about transformers. Yes, Fast Weight Programmers also argued there are more ways to handle variable sized input than using RNNs. No, I don't think the idea is special at all. The main point of Attention is all you need was that removing something of the then mainstream architecture made it faster (or larger) to train while keeping the quality. It was the timing that made it special, because it successfully went against mainstream and they made it work, not the idea itself.

5

JustOneAvailableName t1_ixid0yr wrote

Schmidhuber would have a way better point if he kept it to quality criticism. His OG papers are very often pretty far from the idea he tries to take credit for. Not that the high cited papers aren't a special case of something Schmidhuber also wrote a paper about, but in the same vein you can say that every paper is just a special case of a NN.

27

JustOneAvailableName t1_iqqovjt wrote

You could make an impact with lots of data in a low resource language. You can't make an impact without experience in this area.

The 1060 is absolutely useless for any kind of training, it was a low tier GPU 6 years ago. The older techniques are fine on a CPU.

2