Submitted by Simusid t3_1280rhi in MachineLearning
IntelArtiGen t1_jeguknc wrote
I've used autoencoders on spectrograms and in theory you don't need an A100 or 80M spectrograms to have some results.
I've not used ViTMAE specifically but I read similar papers. I'm not sure on how to interpret the value of the loss. You can use some tips which are valid for most of DL projects. Can your model overfit on a smaller version of your dataset (1000 spectrograms) ? If yes, perhaps your model isn't large / efficient enough to process your whole dataset (though bird songs shouldn't be that hard to learn imo). At least you could easily do more epochs faster with this method and debug some parameters. If your model can't overfit, you may have a problem in your pre/post processing.
Do ViTMAE models need normalized inputs? Spectrograms can have large values by default which may not be easy to process, they may be hard to normalize. Your input and your output should be in a coherent range of values and you should use the right layers in your model if you want that to happen. Also fp16 training can mess up with that.
ViTMAE isn't specifically for sounds right? I think there have been multiple attemps to use it for sounds, this paper (https://arxiv.org/pdf/2212.09058v1.pdf) cites other papers:
>Inspired by the success of the recent visual pre-training method MAE [He et al., 2022], MSM-MAE [Niizumi et al., 2022], MaskSpec [Chong et al., 2022], MAE-AST [Baade et al., 2022] and Audio-MAE [Xu et al., 2022] learn the audio representations following the Transformer-based encoder-decoder design and reconstruction pre-training task in MAE
You can try to see their results and how they made it work, these papers probably also published their code.
Be careful with how you process sounds, the pre/post processing is different than for images which may induce some problems.
Simusid OP t1_jegwcb6 wrote
All good info, thanks for the tips. I think ML for audio lags far behind imagery and NLP. I'm particularly interested in transients and weak signals.
Viewing a single comment thread. View all comments