hijacked_mojo t1_jcpstsu wrote on March 18, 2023 at 4:56 PM

Reply to comment by FunQuarter3511 in Question on Attention by FunQuarter3511

Yes, you have the right idea but also add this to your mental model: the queries and values are influenced by their *own* set of weights. So it's not only the keys getting modified, but also queries and values.

In other words, the queries, keys and values weights all get adjusted via backprop to minimize the error. So it's entirely possible on a backprop that the value weights get modified a lot (for example) while the key weights are changed little.

It's all about giving the network the "freedom" to adjust itself to best minimize the error.

hijacked_mojo t1_jcpaon7 wrote on March 18, 2023 at 2:52 PM

Reply to Question on Attention by FunQuarter3511

Keys come from weights, and the dot product determines how much attention a particular query vector should get. The weights are then adjusted during backprop to minimize the error, and thereby modify the keys.

I have a video that goes through everything step-by-step:
https://www.youtube.com/watch?v=acxqoltilME