Submitted by stecas t3_zymi6r in MachineLearning
If the tokens correspond to channels (extracted by some set of conv layers), then this would seem to make the inputs to the transformer much more interpretable. The features that a channel ends up encoding can be studied whereas a spatial location is just a spatial location.
Unlikely-Video-663 t1_j284flc wrote
In CNNs you usually already have long range dependencies channel wise - and imho one of the advantages of vit is allowing long range spatial information flow as well.
So channel-wise tokenization would not improve upon CNNs.. maybe?