Skip to content

🧬 Cooperative Matrices

This is mostly for people interested in deep learning or programming GPUs to the fullest. This should also include people interested in graphics due to DLSS, denoisers and upscaling becoming so prevalent. Cooperative matrices are known as tensor cores on Nvidia systems.

Anyways, a GPU is a complex beast, but this section will specifically talk about one area of the chip that was introduced into the mainstream consumer cards starting with the RTX 20-series. Ironically, the GPU started off as quite specialized for graphics workloads, as in drawing triangles fast, but was opened up and made increasingly flexible. With the introduction of tensor cores and ray tracing cores it somehow became both more flexible and more specialized at the same time. I won't get into the ray tracing cores in this section, but it's pretty awesome!

Cooperative matrices are basically ALU's made for linear operations on small tiles of matrices. They are really good at multiplying a small matrix by another small matrix and adding another small matrix, while accumulating the result. It can do it in a number of different levels of precision.

Image

Nvidia's visualization of the mathematical operation a tensor core can perform. Image credit

While the tensor cores support matrix-matrix multiplication, they are much more limited in the size of the multiplication. For a general linear operation, you might need to still uses loops, but you would then be tiling your matrix instead, sending a 4x4 tile at a time of your matrices and keep track of your precisions, such as accumulating in a higher level of precision. You can read more about it here and for Vulkan here.

If you keep the calculations numerically stable you can even keep all of your weights during the training of a neural network in 8-bit floating point, while accumulating in 16-bit floating point or greater, which will greatly reduce the bandwidth needed for training. For inference, it can also yield a big speedup all the way down to 8-bit integers. Remember, that integers compress better than floating point numbers. So if you do quantization of your entire network before inference, you can get faster inference and lower power consumption. The cooperative matrix was first available in CUDA, but has since been made available in Vulkan, although the usage of it is not that wide spread yet as it is a fairly recent addition.

Image

Nvidia's visualization of a cooperative matrix. Image credit

If you are working with a library like PyTorch and running training on a GPU you might be able to get PyTorch to make use of tensor cores by automatic mixed precision

Additional Reading

You can check out some material on tensor cores and programming tensor cores in CUDA. Or you can check out cooperative matrices and machine learning in Vulkan. There's also some material on
Accelerating inference with sparsity and a series of videos regarding how to use tensor cores for mixed precision training.