[D] In vision transformers, why do tokens correspond to spatial locations and not channels? Submitted by stecas t3_zymi6r on December 30, 2022 at 1:06 AM in MachineLearning 6 comments 1