Submitted by FunQuarter3511 t3_11ugj0f in deeplearning
FunQuarter3511 OP t1_jcpmkyt wrote
Reply to comment by hijacked_mojo in Question on Attention by FunQuarter3511
>I have a video that goes through everything
First off, this video is amazing! You definitely have a new subscriber in me and I will be sharing! Hope you keep making content!!
So I was originally thinking about this like a python dictionary/hash table where you have keys and values, and you retrieve values when the "query" = key.
Rather what is happening here, is that the "loudest" (by magnitude) key is expected to get the highest weight. This is okay, because the key/query (and value) weight matrix are learned anyways, so during backprop, the most important key will just learn to be louder (in addition to being able to learn from the value weights matrix as well).
In essence, the python dictionary is just the wrong analogy to be using here. We are not necessarily giving greater weights to key/query pairs that are similar. But rather, we want the most important keys to be large, which it will learn.
Does that sound right?
hijacked_mojo t1_jcpstsu wrote
Yes, you have the right idea but also add this to your mental model: the queries and values are influenced by their *own* set of weights. So it's not only the keys getting modified, but also queries and values.
In other words, the queries, keys and values weights all get adjusted via backprop to minimize the error. So it's entirely possible on a backprop that the value weights get modified a lot (for example) while the key weights are changed little.
It's all about giving the network the "freedom" to adjust itself to best minimize the error.
FunQuarter3511 OP t1_jcptitk wrote
That makes a ton of sense. Thanks for your help! You are a legend!
Viewing a single comment thread. View all comments