herokocho t1_izukksc wrote on December 11, 2022 at 11:23 PM

TPU is massively better price/performance at the cluster scale in practice due to better interconnect leading to better utilization, but worse price/performance at the single-node scale.

Shardsmp OP t1_izwhsfm wrote on December 12, 2022 at 10:42 AM

is there any data to back this up?
How do I know where exactly the line is, from what scale it is worth it more to use a TPU?

herokocho t1_izxnzhd wrote on December 12, 2022 at 4:50 PM

not aware of any good comparisons out there, this is all anecdata from looking at profiler traces when training diffusion models and noticing that I was communication bottlenecked even on TPUs, so on GPUs it would be much worse.

it's usually better to use TPU as soon as you'd have to use multiple GPU nodes, and basically always better at v4-128 scale and above (v4-128 has 2x faster interconnect than anything smaller).