I would like to take the word embeddings of a text and visualize them at the same plot (for understanding reasons). The question is how i should pass the text into the pretrained BERT model? At first, i separated the text on sentences and passed each one separetely, but im not sure if this had the right results.

Comments

You must log in or register to comment.

neuralbeans t1_irw2mza wrote on October 11, 2022 at 2:03 PM

#77,618

If you're talking about the contextual embeddings that BERT is known for then those change depending on the sentence used, so you need to supply the full sentence.

sonudofsilence OP t1_irw4bmr wrote on October 11, 2022 at 2:15 PM

#77,699

Replying to neuralbeans (#77,618)

Yes, that's why i want to pass "all the text" into bert, because for example a word in a sentence has to have similar vector with the same word (with same meaning) in another sentence. How can i accomplish that, as the max tokens number of bert is 512?

neuralbeans t1_irw4jiw wrote on October 11, 2022 at 2:17 PM

#77,709

You're supposed to pass in each sentence separately, as a list of sentences. You do not pass all the sentences as one string.

sonudofsilence OP t1_irw765w wrote on October 11, 2022 at 2:35 PM

#77,888

Replying to neuralbeans (#77,709)

Yes, i know but in this way the embedding of a word will be created according only to the tokens of the sentence in which it is found, right?

ExchangeStrong196 t1_irw93ux wrote on October 11, 2022 at 2:49 PM

#78,026

Replying to sonudofsilence (#77,888)

Yes. In order to ensure the contextual token embedding attends to longer text, you need to use a model that accepts larger sequence lengths. Check out Longformer

sonudofsilence OP t1_irwah22 wrote on October 11, 2022 at 2:58 PM

#78,106

Thankss