_learn_faster_ OP t1_ja6zovh wrote on February 27, 2023 at 8:34 AM Reply to comment by machineko in [D] Faster Flan-T5 inference by _learn_faster_ We have GPUs (e.g. A100) but can only use 1 GPU per request (not multi-gpu). We are also willing to take a bit of an accuracy hit. Let me know what you think would be best for us? When you say compression do you mean things like pruning and distillation? Permalink Parent 1
_learn_faster_ OP t1_j9nuqe3 wrote on February 23, 2023 at 8:35 AM Reply to comment by guillaumekln in [D] Faster Flan-T5 inference by _learn_faster_ For flan-t5 does this only work for a Translation task? Permalink Parent 1
_learn_faster_ OP t1_ja6zovh wrote
Reply to comment by machineko in [D] Faster Flan-T5 inference by _learn_faster_
We have GPUs (e.g. A100) but can only use 1 GPU per request (not multi-gpu). We are also willing to take a bit of an accuracy hit.
Let me know what you think would be best for us?
When you say compression do you mean things like pruning and distillation?