eeng_ t1_iy82r1q wrote on November 29, 2022 at 1:24 PM

This is probably obvious to you, but most of the frames in a long video are redundant and provide little additional information. You could easily extract some key frames (eg substract previous frame from current frame and apply a fixed threshold), then run your network only on key frames and then ensemble these key frame predictions into a single label per video.

Vae94 OP t1_iy8fy1f wrote on November 29, 2022 at 3:11 PM

Yes. Thanks for sanity check!

I was thinking of first coming up with algorithm to find outliers and the training LSTM only on the outliers, for that I should assemble some meta-algorithm I guess and train both LSTM and trimming network at the same time.

I was wondering if something like this exists in literature already?