[R]Wq can be omited in single head attention Submitted by wangyi_fudan t3_y2w87i on October 13, 2022 at 11:27 AM in MachineLearning 7 comments 17
UltimateGPower t1_is6adxb wrote on October 13, 2022 at 4:42 PM Why is it necessary for multiple heads? What this proof shows is that it is enough to transform either the keys or queries. Permalink 3
Viewing a single comment thread. View all comments