Simusid
Simusid OP t1_jegspl8 wrote
Reply to comment by Art10001 in [discussion] Anybody Working with VITMAE? by Simusid
VITMAE isn't a generative model. The intent is to use unlabeled data to train the encoder. After that, the decoder is thrown away. Then (in theory) I would use a relatively small amount of labeled data and the encoder with a new head to do traditional supervised classification.
Simusid t1_jdnag2i wrote
Reply to comment by RiotSia in [D] Simple Questions Thread by AutoModerator
I’m unable to connect to hamata.so. Can you tell me what kind of analysis you want to do?
Simusid OP t1_jciguq5 wrote
Reply to comment by deliciously_methodic in [Discussion] Compare OpenAI and SentenceTransformer Sentence Embeddings by Simusid
"words to numbers" is the secret sauce of all the models including the new GPT-4. Individual words are tokenized (sometimes into "word pieces") and a mapping from the tokens to numbers via a vocabulary is made. Then the model is trained on pairs of sentences A and B. Sometimes the model is shown a pair where B correctly follows A, and sometimes not. Eventually the model learns to predict what is most likely to come next.
"he went to the bank", "he made a deposit"
B probably follows A
"he went to the bank", "he bought a duck"
Does not.
That is one type of training to learn valid/invalid text. Another is "leave one out" training. In this case the input is a full sentence minus one word (typically).
"he went to the convenience store and bought a gallon of _____"
and the model should learn that the most common answer will probably be "milk"
​
Back to your first question. In 3D your first two embeddings should be closer together because they are similar. And they should be both "far' from the third encoding.
Simusid t1_jc0ltpb wrote
Reply to comment by kuraisle in [D] Simple Questions Thread by AutoModerator
I downloaded over 1M and it cost me about $110
Simusid OP t1_jbvrbnu wrote
Reply to comment by lppier2 in [Discussion] Compare OpenAI and SentenceTransformer Sentence Embeddings by Simusid
Well that was pretty much the reason I did this test. And right now I'm leaning toward SentenceTransformers.
Simusid OP t1_jbu5594 wrote
Reply to comment by pyepyepie in [Discussion] Compare OpenAI and SentenceTransformer Sentence Embeddings by Simusid
Actually the curated dataset (ref github in original post) is almost perfectly balanced. And yes, sentence embeddings is probably the SOTA approach today.
I agree that when I say the graphs "seems similar", that is a very qualitative label. However I would not say it "means nothing". At the far extreme if you plot:
x = UMAP().fit(np.random.random((10000,75)))
plt.scatter(x.embedding_[:,0], x.embedding_[:,1], s=1)
You will get "hot garbage", a big blob. My goal, and my only goal was to visually see how "blobby" OpenAI was vs ST. And clearly they are visually similar.
Simusid OP t1_jbu3q8m wrote
Reply to comment by polandtown in [Discussion] Compare OpenAI and SentenceTransformer Sentence Embeddings by Simusid
Here is some explanation about UMAP axes and why they should usually be ignored: https://stats.stackexchange.com/questions/527235/how-to-interpret-axis-of-umap
Basically it's because they are nonlinear.
Simusid OP t1_jbu2n5w wrote
Reply to comment by utopiah in [Discussion] Compare OpenAI and SentenceTransformer Sentence Embeddings by Simusid
That was not the point at all.
Continuing the cat analogy, I have two different cameras. I take 20,000 pictures of the same cats with both. I have two datasets of 20,000 cats. Is one dataset superior to the other? I will build a model that tries to predict cats and see if the "quality" of one dataset is better than the other.
In this case, the OpenAI dataset appears to be slightly better.
Simusid OP t1_jbu229y wrote
Reply to comment by pyepyepie in [Discussion] Compare OpenAI and SentenceTransformer Sentence Embeddings by Simusid
Regarding the plot, the intent was not to measure anything, nor identify any specific differences. UMAP is an important tool for humans to get a sense of what is going on at a high level. I think if you ever use a UMAP plot for analytic results, you're using it incorrectly.
At a high level I wanted to see if there were very distinct clusters or amorphous overlapping blobs and to see if one embedding was very distinct. I think these UMAPs clearly show good and similar clustering.
Regarding the classification task; Again, this is a notional task and not trying to solve a concrete problem. The goal was to use nearly identical models with both sets of embeddings to see if there were consistent differences. There were. The OpenAI models marginally outperforms the SentenceTransformer models every single time (several hundreds runs with various hyperparameters). Whether it's a "carefully chosen" task or not is immaterial. In this case "carefully chosen" means softmax classification accuracy of the 4 labels in the curated dataset.
Simusid OP t1_jbu0bkv wrote
Reply to comment by utopiah in [Discussion] Compare OpenAI and SentenceTransformer Sentence Embeddings by Simusid
>We do but is it what embedding actually provide or rather some kind of distance between items,
A single embedding is a single vector, encoding a single sentence. To identify a relationship between sentences, you need to compare vectors. Typically this is done with cosine distance between the vectors. The expectation is that if you have a collection of sentences that all talk about cats, the vectors that represent them will exist in a related neighborhood in the metric space.
Simusid OP t1_jbtp8wr wrote
Reply to comment by deliciously_methodic in [Discussion] Compare OpenAI and SentenceTransformer Sentence Embeddings by Simusid
Given three sentences:
- Tom went to the bank to make a payment on his mortgage.
- Yesterday my wife went to the credit union and withdrew $500.
- My friend was fishing along the river bank, slipped and fell in the water.
Reading those you immediately know that the first two are related because they are both about banks/money/finance. You also know that they are unrelated to the third sentence even though the first and third share the word "bank". If we had naively encoded a strictly word based model, it might incorrectly associate the first and third sentences.
What we want is a model that can represent the "semantic content" or idea behind a sentence in a way that we can make valid mathematical comparisons. We want to create a "metric space". In that space, each sentence will be represented by a vector. Then we use standard math operations to compute the distances between the vectors. In other words, the first two sentences will have vectors that point basically in the same direction, and the third vector will point in a very different direction.
The job of the language models (BERT, RoBERTa, all-mpnet-v2, etc) are to do the best job possible turning sentences into vectors. The output of these models are very high dimension, 768 dimensions and higher. We cannot visualize that, so we use tools like UMAP, tSNE, PCA, and eig to find the 2 or 3 most important components and then display them as pretty 2 or 3D point clouds.
In short, the embedding is the vector that represents the sentence in a (hopefully) valid metric space.
Simusid OP t1_jbt962j wrote
Reply to comment by jobeta in [Discussion] Compare OpenAI and SentenceTransformer Sentence Embeddings by Simusid
UMAP()
Simusid OP t1_jbt91tb wrote
Reply to comment by ID4gotten in [Discussion] Compare OpenAI and SentenceTransformer Sentence Embeddings by Simusid
My main goal was to just visualize the embeddings to see if they are grossly different. They are not. That is just a qualitative view. My second goal was to use the embeddings with a trivial supervised classifier. The dataset is labeled with four labels. So I made a generic network to see if there was any consistency in the training. And regardless of hyperparameters, the OpenAI embeddings seemed to always outperform the SentenceTransformer embeddings, slightly but consistency.
This was not meant to be rigorous. I did this to get a general feel of the quality of the embeddings, plus to get a little experience with the OpenAI API.
Simusid OP t1_jbt4y5s wrote
Reply to comment by krishnakumar3096 in [Discussion] Compare OpenAI and SentenceTransformer Sentence Embeddings by Simusid
I was lazy and used the model they show in their code example found here https://platform.openai.com/docs/guides/embeddings/what-are-embeddings.
Also on that page, they show that Ada outperform Davinci (BEIR score) and is cheaper to use.
Simusid OP t1_jbt13iy wrote
Reply to comment by imaginethezmell in [Discussion] Compare OpenAI and SentenceTransformer Sentence Embeddings by Simusid
8K? I'm not sure what you're referring to.
Simusid OP t1_jbsyp5n wrote
Yesterday I set up a paid account at OpenAI. I have been using the free sentence-transformers library and models for many months with good results. I compared the performance of the two by encoding 20K vectors from this repo https://github.com/mhjabreel/CharCnn_Keras. I did no preprocessing or cleanup of the input text. The OpenAI model is text-embedding-ada-002 and the SentenceTransformer model is all-mpnet-base-v2. The plots are simple UMAP(), with all defaults.I also built a very generic model with 3 dense layers, nothing fancy. I ran each model ten times for the two embeddings, fitting with EarlyStopping, and evaluating with hold out data. The average results were HF 89% and OpenAI 91.1%. This is not rigorous or conclusive, but for my purposes I'm happy sticking with SentenceTransformers. If I need to chase decimal points of performance, I will use OpenAi.
Edit - The second graph should be titled "SentenceTransformer" not HuggingFace.
Simusid t1_j9w9icj wrote
Reply to comment by mooshz in Handle with care, me, digital, 2023 by musketon
Yanno they could have roped off the pieces and put up a pretentious art card about "man's inhumanity" and "spontaneously donated by unknown patron" and charged even MORE for it
Simusid t1_j9rtqga wrote
Reply to Where should I (24M) go to meet other people my age? Dating in this day an age sucks and I'm tired of meeting people online by poomodoom
OK this one is going to seem crazy. Take an EMT class. They're pretty short, typically 2 nights per week for 3+ hours plus a few hours on saturdays for practicing skills. Programs vary and many are shorter than 3 months. The cost is probably still around $1,200 but many towns offer local scholarships (we do). EMTs are in incredible demand, you can work literally anywhere, usually on your own schedule, part time, weekends, nights. I work one shift per week. Yes it's SHITTY pay but that's not why you are there.
You will meet lots of people your age. You will meet other EMTs and medics. You will meet firefighters. You will regularly see nurses.
It's honestly fun, for me it was life changing. Guarantee that you will meet new people in ways you cannot imagine!
Simusid t1_j9pj6s9 wrote
Reply to comment by vesselgroans in I would love to take my husband to a nice dinner somewhere new. We live in Providence but will drive about 30 minutes for something good. Recs please! TYIA by Hawks47
How is Traffords now that it has reopened?
Simusid t1_j3sa98w wrote
Can you share your arXiv link?
Simusid t1_j3eq6ey wrote
Reply to comment by Kingman9K in did y'all know that Yosemite National Park is bigger than Rhode Island by New_Analyst3510
I was on a ranch in Montana that was bigger. That amazed me when I was in high school.
Simusid OP t1_jegwcb6 wrote
Reply to comment by IntelArtiGen in [discussion] Anybody Working with VITMAE? by Simusid
All good info, thanks for the tips. I think ML for audio lags far behind imagery and NLP. I'm particularly interested in transients and weak signals.