Submitted by Smooth-Earth-9897 t3_11nzinb in MachineLearning
PassingTumbleweed t1_jbri1kj wrote
I won't repeat what other comments said but there are interesting architectures like H-Transformer that have lower asymptotic complexity and scale to longer sequences than the original Transformer. It's also worth noting that in practice the MLP cost may actually dominate the self-attention cost or vice versa, depending on the sequence length and model size.
Viewing a single comment thread. View all comments