I just checked the papers. There are 16 x 16 total words. The length of sentences is standardized i.e. all images have the same representation length when given to the transformer. It’s not that each word corresponds to 16x16x3 pixels.
But you understand my point right? I’m asking about why the images are cut up into words spatially instead of channel wise.
By spatial locations I mean spatial locations in the image not the ordering of tokens. So instead of an image being worth “16 x 16 words” it would be worth n words where n is some number of extracted features in the form of channels.
stecas t1_j2vw0te wrote
Reply to Saguaro National Park, Tucson AZ [3024x4032] [OC] by OkLandscape4167
r/bossfight