If the tokens correspond to channels (extracted by some set of conv layers), then this would seem to make the inputs to the transformer much more interpretable. The features that a channel ends up encoding can be studied whereas a spatial location is just a spatial location.

Comments

You must log in or register to comment.

Unlikely-Video-663 t1_j284flc wrote on December 30, 2022 at 9:16 AM

#1,175,665

In CNNs you usually already have long range dependencies channel wise - and imho one of the advantages of vit is allowing long range spatial information flow as well.

So channel-wise tokenization would not improve upon CNNs.. maybe?

tdgros t1_j28bwj3 wrote on December 30, 2022 at 10:58 AM

#1,177,023

I don't see tokens corresponding as spatial locations? Before you add or concat a spatial embedding, there is nothing spatial at all, since transformers are permutation invariant! It's only when you add a spatial embedding that the tokens get a relation to their position back.

Maybe you'd prefer concatenating the spatial embeddings as opposed to adding them, so you can (mentally) consider the first channels as content-only and the rest as "spatial-related stuff". It's not strictly true after a first transformer layer, it doesn't change a lot, concat should be the default operation, but adding just makes for smaller tokens and it works fine.

stecas OP t1_j28khc1 wrote on December 30, 2022 at 12:41 PM

#1,178,537

Replying to tdgros (#1,177,023)

By spatial locations I mean spatial locations in the image not the ordering of tokens. So instead of an image being worth “16 x 16 words” it would be worth n words where n is some number of extracted features in the form of channels.

tdgros t1_j28murr wrote on December 30, 2022 at 1:05 PM

#1,178,994

Replying to stecas (#1,178,537)

"An image is worth 16x16 words" means you can cut up an image into Nwords patches that are 16x16 spatially. Depending on the size of the image, that gets you a different Nwords. Each of those words is originally 16x16x3 for RGB images, and is projected linearly to Ndims dimensions (usually ~1000). So you get Nwords words of dimension Ndims! whre Nwords depends on the patch size and the image size, and Ndims is arbitrary.

I don't know if your post is a typo, but you're using the same n twice for the number of words and number of channels/dimensions, which doesn't make sense to me. It might be just a different perspective...

stecas OP t1_j28o6xw wrote on December 30, 2022 at 1:18 PM

#1,179,285

Replying to tdgros (#1,178,994)

I just checked the papers. There are 16 x 16 total words. The length of sentences is standardized i.e. all images have the same representation length when given to the transformer. It’s not that each word corresponds to 16x16x3 pixels.

But you understand my point right? I’m asking about why the images are cut up into words spatially instead of channel wise.

tdgros t1_j28rhcc wrote on December 30, 2022 at 1:48 PM

#1,180,051

Replying to stecas (#1,179,285)

Ah I see what you mean, you're right, my way of seeing is the one that is not standard. My point is that transformers don't really care about the original modality or the order or spatial arrangement of their tokens, ViTs are just transformers over sequences of "patches of pixels" (note, where channels are flattened together!) On top of this, there is work to forcefully bring back locality biases (position embeddings, swin transformers...), this explains why I don't tend to break tokens into different dimensions. You can recompose the sequence into a (H/16)x(W/16)xNdims images, the channels of which can be visualized separately if you want. More often, it's the attention mapsthemselves that are used for visualization or interpetation, head per head (i.e. the number of channels here really is the number of heads)