suflaj t1_it4h2vs wrote on October 20, 2022 at 9:30 PM

I wouldn't use any of them because they don't seem to be worth it, and they're generally unproven on modern, relevant models. If I wanted to minimize variance, I'd just build an ensemble of models.

The best advice I can give you is to disregard older papers, model averaging is like a 4 year old idea and doesn't seem to be used much in practice.

Lee8846 t1_it5hqba wrote on October 21, 2022 at 2:00 AM

I wouldn't say so. One cannot judge the value for a specific method by whether it's old or new. For example, in self-supervised learning, like in the work of MOCO, people still use moving average. It's a nice technique to maintain the consistency of query encoder. By the way, EMA actually helps to smooth the weights fluctuation in some case, which may be caused by the patterns of the data. In this case, an ensemble of models might not help.

suflaj t1_it66q7y wrote on October 21, 2022 at 5:43 AM

While it is true that the age of a method does not determine its value, the older a method is, the more likely the performance gains you get are surpassed by some other method or model.

Specifically I do not see why I would use any weight averaging over a better model or training technique.

> In this case, an ensemble of models might not help.

Because you'd just use a bigger batch size

Ttttrrrroooowwww OP t1_it5qizp wrote on October 21, 2022 at 3:08 AM

Currently my research focuses mostly on the semi-supervised space, and especially EMA is still relevant. Apparently its good to reduce confirmation biased on the inherent noisyness of pseudo labels.

While that agrees with your statement and answers my question (that I should use EMA because its relevant), I found some codes that dont mention all methods in their publications but they exist in their codebase.

suflaj t1_it686mk wrote on October 21, 2022 at 6:00 AM

This would depend on whether or not you believe newer noisy data is more important. I would not use it generally because it's not something you can guarantee on all data and would have to be theoretically confirmed beforehand, which might be impossible given a task.

If I wanted to reduce the noisiness of pseudo-labels I would not want to introduce additional biases on the data itself, so I'd rather do sample selection, which seems to be what the newest papers suggest to do. Weight averaging is introducing biases akin to what weight normalization techniques did, which were partially abandoned in favour of different approaches, ex. larger batch sizes, because they proved to be more robust and performant in practice as we got models more different than the ML baselines we based our findings on.

Now, if I wasn't aware of papers that came out this year, maybe I wouldn't be saying this. That's why I recommended you stick to newer papers, becuase problems are never really fully solved and newer solutions tend to make bigger strides than optimizing older ones.

Ttttrrrroooowwww OP t1_it6dz0l wrote on October 21, 2022 at 7:14 AM

Can you point me to the papers you reference?

Ive only come across 2019 papers about sample selection (assuming you mean data sampling)

suflaj t1_it6fwta wrote on October 21, 2022 at 7:40 AM

That is way too old. Here are a few papers:

https://arxiv.org/abs/2202.07136

https://arxiv.org/abs/2201.10836

Ttttrrrroooowwww OP t1_it6n348 wrote on October 21, 2022 at 9:24 AM

Thanks a lot

I read PARS. Looks very interesting and is somewhat related to pseudo-label entropy minimization. Im thinking of going in a similar direction, a great tip.