Viewing a single comment thread. View all comments

MazenAmria OP t1_izt68w9 wrote

I'm using with torch.no_grad(): when calculating the output of the teacher model.

1

suflaj t1_iztjolh wrote

Then it's strange. Unless you're using a similarly sized student model, there is no reason why a no_grad teacher and a student are similarly resource intensive as a teacher with backprop.

As a rule of the thumb, you should expend several times less memory. How much less are you expending for the same batch size in your case?

1