Submitted by No_Possibility_7588 t3_10gu68z in deeplearning

I predicted that a certain change in the architecture of my agents would boost their coordination (in the context of multi-agent reinforcement learning). However, I tested this in the Meetup environment and it is not working, in the sense that it performs slightly worse than the baseline. This is how the environment works: three agents must collectively choose one of K landmarks and congregate near it. At each time step, each agent receives reward equal to the change in distance between itself and the landmark closest to all three agents. The goal landmark changes depending on the current position of all agents. When all K agents are adjacent to the same landmark, the agents receive a bonus of 1 and the episode ends.

Scientifically speaking, how can I be rigorous about testing this hypothesis again? A few ideas:

1 Repeat the experiment multiple times with different random seeds to ensure that the results are robust and not influenced by random variations.

2 Vary the parameters of the agent

  • Vary the number of modules used in the policy and test the effect on coordination.
  • Increase the number of agents

3 Vary the parameters of the environment

  • Changing the number of landmarks
  • Adding distractors

4 Test another environment

What do you think?

-

4

Comments

You must log in or register to comment.

suflaj t1_j54vkhl wrote

Well you would at minimum need to explain why it didn't meet your expectations. You need to elaborate on what grounds you hypothesized what you hypothesized, and why that was a wrong basis or elaborate on what happened that you didn't predict would.

I also, ex., assumed that vision transformers would get better performance on the task I had. But when they didn't (they were sometimes outperformed by YOLO v1), I investigated why, and laid out the proof why it was not human error (aside from my judgement), as well as suggestions on how to proceed next. To do that I rerun the experiment many times, changed hyperparameters, swapped out detectors, all to narrow down that it wasn't actually the inadequate settings, but the arch and specific model themselves.

6

No_Possibility_7588 OP t1_j5515hg wrote

Right, thanks for your input! So what you're saying is, you manipulated all sort of possible confounding variables (hyperparameters, detectors, etc.) and concluded that it had to be related to the model's architecture. Which I guess is similar to what I was suggesting in my post, right? Changing the number of landmarks, changing the number of distractors, etc., and noticing whether the same result holds

1

suflaj t1_j5521eb wrote

Well yes and no. The variable manipulation was just to prove that the implementation won't work. I also had to go deeper into the theoretical reasons why it wouldn't work.

This is something you can't (easily) prove with the implementation (you don't even have the guarantee that the implementation is correct), but you can disprove the hypothesis that it is due to a specific component. I used this counterproof as a basis why it is not due to the changed components, but the base network. Then I had to compare 2 different tasks on the same data to prove that poor performance is not tied to the actual base network or subcomponents being too weak, but rather how the information of the base network is used by the subcomponents. Once I proved that the different, but similarly difficult task works, I had proof that it's not the data, nor the modules, but either the task or information flow. I knew the task is not flawed or too hard because smaller networks solved the problem (I was just aiming for better than solved).

Specifically, I argued that transformers don't necessarily feature the spatial bias CNNs have and as such make it harder for the convolutional detectors to work with arbitrarily permuted features. I also showed that with sufficiently prolonged training, the detectors would become better, but I concluded that at that rate, it would be more viable to pretrain everything from scratch, for what I didn't have the budget.

I also confirmed double descent behaviour, which made all of this out of scope for my graduate thesis. Consult with your mentor/colleagues to make sure you're not going out of scope, either.

4

No_Possibility_7588 OP t1_j556292 wrote

Thank you, you're being very helpful!

The variable manipulation part is clear - if I may ask one question about the theoretical justification: how did you have the insight that the problem might be related to spatial bias? Was it pure intuition?

1

suflaj t1_j557l5p wrote

Initially I knew that the biggest difference between previous approaches was the base network. Previously it was a resnet, now a transformer. The transformer is free to rearrange features. Because classification worked well, I know that it wasn't the features themselves. In fact, even less present classes were solved perfectly. So I suspected it was the arrangement of features, since classification is done by a linear layer, which is also free to permute features however it wants.

Then after trying out every implemented convolutional detector and getting the same results, I was even more suspicious. What nailed it down was tracking how the features changed. Anyways, as I was training the detector more and more I saw that the transformer's pool features changed as well. But when I froze the transformer network weights and tried a different task, performance didn't change in a statistically significant way.

When I looked at the activations I saw that on the transformer part, they do not correlate to the locations spatially. So, based on how CNNs work, I knew that as opposed to resnet-based feature extractors, they're giving shuffled outputs.

And finally, because I observed double descent I called upon previous work that hypothesized that the phenomenon might be the restructuring of the network itself. Because I confirmed that the restructuring happening in the transformer part didn't change the performance, I could hypothesize that the restructuring is likely related to spatial properties. I could not confirm whether it would ever converge or generalize, as the increases were from like 0.40 MAP50 to 0.50 MAP50, while I was contesting 0.93 MAP50 scores that were still quite flawed despite being state of the art. And outside out the metrics it was not that obvious that the performance was so much better - even my mentor said "Wow, it works well". Until I showed him the messed up results.

4