Viewing a single comment thread. View all comments

No_Possibility_7588 OP t1_j556292 wrote

Thank you, you're being very helpful!

The variable manipulation part is clear - if I may ask one question about the theoretical justification: how did you have the insight that the problem might be related to spatial bias? Was it pure intuition?

1

suflaj t1_j557l5p wrote

Initially I knew that the biggest difference between previous approaches was the base network. Previously it was a resnet, now a transformer. The transformer is free to rearrange features. Because classification worked well, I know that it wasn't the features themselves. In fact, even less present classes were solved perfectly. So I suspected it was the arrangement of features, since classification is done by a linear layer, which is also free to permute features however it wants.

Then after trying out every implemented convolutional detector and getting the same results, I was even more suspicious. What nailed it down was tracking how the features changed. Anyways, as I was training the detector more and more I saw that the transformer's pool features changed as well. But when I froze the transformer network weights and tried a different task, performance didn't change in a statistically significant way.

When I looked at the activations I saw that on the transformer part, they do not correlate to the locations spatially. So, based on how CNNs work, I knew that as opposed to resnet-based feature extractors, they're giving shuffled outputs.

And finally, because I observed double descent I called upon previous work that hypothesized that the phenomenon might be the restructuring of the network itself. Because I confirmed that the restructuring happening in the transformer part didn't change the performance, I could hypothesize that the restructuring is likely related to spatial properties. I could not confirm whether it would ever converge or generalize, as the increases were from like 0.40 MAP50 to 0.50 MAP50, while I was contesting 0.93 MAP50 scores that were still quite flawed despite being state of the art. And outside out the metrics it was not that obvious that the performance was so much better - even my mentor said "Wow, it works well". Until I showed him the messed up results.

4