Viewing a single comment thread. View all comments

--dany-- t1_jedo4gy wrote

How about using the embeddings of the whole post? Then you just have to train a model to predict trait from one post. A person’s overall trait can be the average of all traits predicted by all of his posts. I don’t see a point in using RNN over posts.

1

-pkomlytyrg t1_jeeeo36 wrote

I would embed the whole post (BigBird or OpenAI embeddings have really long context lengths), and just feed that vector into an RNN. As long as the post is between one and 9000 tokens, the embedding shape will remain the same

1

danilo62 OP t1_jeextek wrote

Oh, so those models are able to produce fixed size embeddings of texts? I wasn't aware of that

2

-pkomlytyrg t1_jef4weq wrote

Generally, yes. If you use a model with a long context length (BigBird or OpenAI’s ada02), you’ll likely be fine unless the articles you’re embedding are greater than the token limit. If your using BERT or another, smaller model, you have to chunk/average; that can produce fixed sized vectors but you gotta put the work in haha

1

danilo62 OP t1_jefnror wrote

Yeah I'm gonna try both options (with BERT and the bigger models) but since I'm working with a big dataset I'm not sure I'll be able to use the larger models due to the token and request limits. Thanks for the help

1

danilo62 OP t1_jeeygpe wrote

But even then, with the other features (sentiment analysis, tf-idf) how would I feed a vector containing a varying number of tokens and other types of features? I can't see how you would this using an RNN. That is for each post

1

--dany-- t1_jefcm9m wrote

The embedding contains all information like sentiment or tf-idf. You just need to train a model to predict trait from post embedding then average over all posts by a person. I didn’t suggest using RNN. Are you sure you were replying my comment?

1

danilo62 OP t1_jefo88t wrote

Oh, I hadn't realized that you meant that the embedding would contain information about other features, I get it now. I was referencing an RNN since I thought it was the only option due to the variable input size. Thanks

1