stecas t1_j2vw0te wrote on January 4, 2023 at 8:28 AM

Reply to Saguaro National Park, Tucson AZ [3024x4032] [OC] by OkLandscape4167

stecas OP t1_j28o6xw wrote on December 30, 2022 at 1:18 PM

Reply to comment by tdgros in [D] In vision transformers, why do tokens correspond to spatial locations and not channels? by stecas

I just checked the papers. There are 16 x 16 total words. The length of sentences is standardized i.e. all images have the same representation length when given to the transformer. It’s not that each word corresponds to 16x16x3 pixels.

But you understand my point right? I’m asking about why the images are cut up into words spatially instead of channel wise.

stecas OP t1_j28khc1 wrote on December 30, 2022 at 12:41 PM

Reply to comment by tdgros in [D] In vision transformers, why do tokens correspond to spatial locations and not channels? by stecas

By spatial locations I mean spatial locations in the image not the ordering of tokens. So instead of an image being worth “16 x 16 words” it would be worth n words where n is some number of extracted features in the form of channels.