Viewing a single comment thread. View all comments

PassingTumbleweed t1_jbri1kj wrote on March 11, 2023 at 3:32 AM

I won't repeat what other comments said but there are interesting architectures like H-Transformer that have lower asymptotic complexity and scale to longer sequences than the original Transformer. It's also worth noting that in practice the MLP cost may actually dominate the self-attention cost or vice versa, depending on the sequence length and model size.