pommedeterresautee OP t1_itty9y4 wrote on October 26, 2022 at 8:22 AM

Reply to comment by ganzzahl in [P] Up to 12X faster GPU inference on Bert, T5 and other transformers with OpenAI Triton kernels by pommedeterresautee

Yes we are!

In the post there is a link to T5 notebook, we did a rapid test, and speedup on T5 is really high (6X). It's just the beginning. Existing kernels probably already works with most generative languages (like GPT2, etc.), we just need to write replacement patterns (to search the PyTorch part and replace it with our kernels).

T5 notebook : https://github.com/ELS-RD/kernl/blob/main/tutorial/t5%20e2e.ipynb

We are currently working on RMSNorm, a kind of simplified LayerNorm used in T5 (kernel done and merged, we are focusing on the replacement pattern).

Quite surprisingly, RMSNorm bring a huge unexpected speedup on what we already had! If you want to follow this work: https://github.com/ELS-RD/kernl/pull/107

If you can't wait to use those kernels on your model, there is a part in the README of the project which explains how to write replacement pattern, it should be quite easy.

pommedeterresautee OP t1_ittwob9 wrote on October 26, 2022 at 7:59 AM

Reply to comment by juliensalinas in [P] Up to 12X faster GPU inference on Bert, T5 and other transformers with OpenAI Triton kernels by pommedeterresautee

Thank you Julien for your kind message.

We would be very happy to receive your feedback in the context of NLP cloud SAAS, like does it cover some of your needs, what you would expect that is not yet here and not in the roadmap, pesky bugs, etc.

pommedeterresautee OP t1_ittvxdq wrote on October 26, 2022 at 7:48 AM

Reply to comment by ginsunuva in [P] Up to 12X faster GPU inference on Bert, T5 and other transformers with OpenAI Triton kernels by pommedeterresautee

We are actively looking for new names that could make things even more confusing.

If you have some ideas, please share them with us 🙃

pommedeterresautee OP t1_ittsubj wrote on October 26, 2022 at 7:03 AM

Reply to comment by ptillet in [P] Up to 12X faster GPU inference on Bert, T5 and other transformers with OpenAI Triton kernels by pommedeterresautee

Thank you a lot for *your* work and your message :-)

Regarding the bugs, for now they have been mostly workable, we follow with lots of excitement the MLIR rewriting and try to prepare ourselves.

I am really wondering what will happen to the ML community when Pytorch will release TorchDynamo / Inductor and so many people will start using Triton in their day to day work. Then tens of thousands of people or more with different backgrounds may start writing kernels...

As they say, what a time to be alive!