Recent comments in /f/MachineLearning

KD_A OP t1_jegxas8 wrote

Interesting, and I think I know what you mean. One naive idea is a "top-k tokens" system. This system considers the top k highest probability tokens (conditional on previous ones) for each completion token, and for each completion. And then take the sum of the average likelihoods across all k^n (n = # completion tokens) paths for each completion. That would be one way to address this synonym problem. But ofc it results in way more computation.

Edit: actually, thinking a bit more, I think the synonym problem is more-or-less a non-issue for LMs trained to do next-token prediction.

2

IntelArtiGen t1_jeguknc wrote

I've used autoencoders on spectrograms and in theory you don't need an A100 or 80M spectrograms to have some results.

I've not used ViTMAE specifically but I read similar papers. I'm not sure on how to interpret the value of the loss. You can use some tips which are valid for most of DL projects. Can your model overfit on a smaller version of your dataset (1000 spectrograms) ? If yes, perhaps your model isn't large / efficient enough to process your whole dataset (though bird songs shouldn't be that hard to learn imo). At least you could easily do more epochs faster with this method and debug some parameters. If your model can't overfit, you may have a problem in your pre/post processing.

Do ViTMAE models need normalized inputs? Spectrograms can have large values by default which may not be easy to process, they may be hard to normalize. Your input and your output should be in a coherent range of values and you should use the right layers in your model if you want that to happen. Also fp16 training can mess up with that.

ViTMAE isn't specifically for sounds right? I think there have been multiple attemps to use it for sounds, this paper (https://arxiv.org/pdf/2212.09058v1.pdf) cites other papers:

>Inspired by the success of the recent visual pre-training method MAE [He et al., 2022], MSM-MAE [Niizumi et al., 2022], MaskSpec [Chong et al., 2022], MAE-AST [Baade et al., 2022] and Audio-MAE [Xu et al., 2022] learn the audio representations following the Transformer-based encoder-decoder design and reconstruction pre-training task in MAE

You can try to see their results and how they made it work, these papers probably also published their code.

Be careful with how you process sounds, the pre/post processing is different than for images which may induce some problems.

3

Pas7alavista t1_jegu8de wrote

The span describes the entire space. It's a set of vectors that you can combine using addition and multiplication in order to obtain any other vector in the space. For example a spanning set over the real number plane would be {(1,0), (0,1)}. This particular set is also an orthonormal basis and you can think of each vector as representing two orthogonal dimensions. This is because their dot product is 0.

However, any set of two vectors that are not on the same line will span the real number plane. For example, {(1,1), (0,1)} spans the real number plane, but they are not orthogonal.

Overall though it is always important to be aware of your input space, and the features/dimensions that you use to represent it. You can easily introduce bias or just noise in a number of ways if you aren't thorough. One example would be not normalizing your data.

2

turnip_burrito t1_jegu7uk wrote

Yeah I made the simplification of random vectors myself just to approximate what uncorrelated "features" in an embedding space could be like.

One thing that's relevant for embedding space size Takens theorem: https://en.wikipedia.org/wiki/Takens%27s_theorem?wprov=sfla1

If you have an originally D dimensional system (measured using correlation or information dimension for example), and you time delay embed data from the system, you at most (can be lower) need 2*D+1 embedding dimensions to ensure no false nearest neighbors.

This sets an upper bound if you use time delays. Now, for a *non-*time delayed embedding, I don't know the answer. I asked GPT4 and it said no analytical method for determining embedding dimension M presently exists ahead of time. An experimental method does exist that you can perform before training a model: You need to grow the number of embedding dimensions M and calculate FNN every time M grows. Once FNN drops to near zero, then you've finally found a suitable M.

One neat part about all this is that if you have some complex D-dimensional manifold or distribution with features that "poke out" into different directions in the embedding space (imagine a wheel hub with spokes), then increasing the embedding space size M will also increase the distance between the spokes. If M gets large enough, all the spokes should be nearly equal in distance from each other, but points along a singular spoke are also far from each other in most directions except for just a small subset.

I don't think that making it super large would actually make learning on the data any easier though. Best to stick with close to the minimum embedding dimension M. If you get larger, then measurement noise in your data becomes more represented in the embedded distribution. These dynamics also unfold when you increase M, which means if you're trying to only predict the D-dimensional system, you'll have harder time because now you're predicting a (D+large#) dimensional system and the obviousness of the D-dimensional system distribution gets lost in the larger distribution.

2

LoaderD t1_jegsuar wrote

> Here we have a world-class complex recommendation

...You know this is twitter's recommender system right? All the tweets I interact with are ML related from very 'left' people like Jeremy Howard.

My recommender system could legit be:

if interested_in_finance_or_ML:
     recommend_alt_right_hate_speech_accounts()
     recommend_crypto_scam_ads()
24

KD_A OP t1_jegsqe6 wrote

That's a good criticism. I'd guess that this issue is quite problem-dependent. And I'd hope that an LM is good enough to discriminate between the correct-but-many-synonyms class and the wrong-but-few-synonyms class. (We're using the word synonym, but we really mean "high probability token path given prompt".) It's hard for me to come up with examples where this problem arises in a real classification task. But they may be out there.

2

Simusid OP t1_jegspl8 wrote

VITMAE isn't a generative model. The intent is to use unlabeled data to train the encoder. After that, the decoder is thrown away. Then (in theory) I would use a relatively small amount of labeled data and the encoder with a new head to do traditional supervised classification.

1

Necessary-Meringue-1 t1_jegshy4 wrote

It's a pretty cool resource to get to look at an enterprise recommendation algorithm like that.

​

An aside, if you want a chuckle, search the term "Elon" in the repo:https://github.com/twitter/the-algorithm/search?q=elonhttps://github.com/twitter/the-algorithm/search?q=elon&type=issues

​

[edit 1]
since it's gone now, here's the back up provided by u/MjrK:https://i.imgur.com/jxqaByA.png
[edit 2] lol
https://github.com/twitter/the-algorithm/commit/ec83d01dcaebf369444d75ed04b3625a0a645eb9#diff-a58270fa1b8b745cd0bd311bed9cd24c983de80f96e7bd445e16e88b61e492b8L225

100