[D] In vision transformers, why do tokens correspond to spatial locations and not channels? Submitted by stecas t3_zymi6r on December 30, 2022 at 1:06 AM in MachineLearning 6 comments 1
Unlikely-Video-663 t1_j284flc wrote on December 30, 2022 at 9:16 AM In CNNs you usually already have long range dependencies channel wise - and imho one of the advantages of vit is allowing long range spatial information flow as well. So channel-wise tokenization would not improve upon CNNs.. maybe? Permalink 2
Viewing a single comment thread. View all comments