BeatLeJuce t1_j4m9pp6 wrote on January 16, 2023 at 6:44 PM

Overall nice, but the article also uses some expressions without ever explaining them. For example: What is H100, and what is A100. Somewhere in the Article, it says that H100=RTX40 cards, somewhere else it says A100 is a RTX40 card. Which is which?

Also, what is TF32? It's an expression that appears in a paragraph without explanation.

timdettmers t1_j4mfra6 wrote on January 16, 2023 at 7:22 PM

This is good feedback. Wanted to make another pass this morning to clean references like this up, but did not have the time. Will try to be more clear about this in the next update (later today, probably).

JustOneAvailableName t1_j4my6kp wrote on January 16, 2023 at 9:17 PM

Great article!

You say this about sparsity:

> It does not seem so. Since the granularity of the sparse matrix needs to have 2 zero-valued elements, every 4 elements, the sparse matrices need to be quite structured.

Wouldn't a more slightly more structured dropout be a perfect fit?

Freonr2 t1_j4mvhhf wrote on January 16, 2023 at 9:00 PM

A100 and H100 are data center GPUs. Very expensive, tuned for training large models. They also use on-package HBM memory instead of GDDR on the board for improved memory bandwidth.

A100 is Ampere, same architecture as the 30xx series, but built for training with a lot more tensor cores and less focus on Cuda cores. Most often seen in SXM form factor in special servers that offers substantially more NVLink bandwidth between GPUs for multi-gpu training (and the special servers the SXM cards go into also have considerable network bandwidth for clustered training). They do make PCIe versions. Does not support FP8. Typical setup is an AGX server with 8xA100. These are a few hundred grand for the whole server, even ignoring the power and network requirements, etc to utilize it.

H100 is Hopper, newer than Ampere, but I don't believe ever made into a consumer part but perhaps closer to Ada (40xx) in features than it is to Ampere (30xx) since it has FP8. It's basically the replacement for A100, much like the 40xx is the replacement for the 30xx. These are again often in HGX server boxes for a several hundred grand. Unsure if there is a PCIe version?

Nvidia removed NVLink from the 40xx series, but its still technically available on 3090s. They're sort of segmenting the market here.

If they decide to release a 4090 with 48GB (or Ada Titan or whatever branding they decide on) it could be a monster card if you only need or want a single card, but it may also be $3k+...

BeatLeJuce t1_j4p8p5g wrote on January 17, 2023 at 8:08 AM

thanks!

init__27 OP t1_j4mb2iy wrote on January 16, 2023 at 6:53 PM

The author is hanging out and collecting feedback on here. I'm sure he'll correct it in an update.

Maybe I'm too in the roots but if I were in the author's shoes I would assume as well that the reader would know of these terms and cards.

royalemate357 t1_j4migdx wrote on January 16, 2023 at 7:39 PM

TF32 is tensorfloat 32, which is a relatively new precision format for newer GPUs. Basically, when doing math, it uses the same number of mantissa as FP16 (10 bits), and the same number of exponent bits as normal float32 (8 bits). more on it here: https://blogs.nvidia.com/blog/2020/05/14/tensorfloat-32-precision-format/

BeatLeJuce t1_j4p938f wrote on January 17, 2023 at 8:13 AM

Thanks for the explanation? Why call it TF32 when it apperas to have 19 bits? (IIUC it's bfloat16 with 3 additional bits of mantissa?)

royalemate357 t1_j4qdfwj wrote on January 17, 2023 at 3:15 PM

Tbh I don't think it's an especially good name, but I believe the answer to your question is that it actually uses 32 bits to store a TF32 value in memory. its just that when they pass it into tensor cores to do matmuls, they temporarily downcast it to this 19-bit precision format.

>Dot product computation, which forms the building block for both matrix multiplies and convolutions, rounds FP32 inputs to TF32, computes the products without loss of precision, then accumulates those products into an FP32 output (Figure 1).

(from https://developer.nvidia.com/blog/accelerating-ai-training-with-tf32-tensor-cores/)