Viewing a single comment thread. View all comments

PassingTumbleweed t1_jbri1kj wrote

I won't repeat what other comments said but there are interesting architectures like H-Transformer that have lower asymptotic complexity and scale to longer sequences than the original Transformer. It's also worth noting that in practice the MLP cost may actually dominate the self-attention cost or vice versa, depending on the sequence length and model size.

2