High-dimensional layout representation for vector distribution

Motivation

Matrix multiplication is one of the key operators in scientific computing. In machine learning specifically, matrix multiplication shows up in many different models (such as LLAMA2) and accounts for a majority of the computation. As a result, several hardware vendors have specialized matrix multiplication units to accelerate matrix multiplication. For example, NVIDIA GPUs have TensorCores while AMD GPUs have MatrixCores.

In order to generate high performance matrix multiplication code for any of these devices, ML compilers make use of this specialized hardware by emitting the corresponding matrix-multiply-accumulate (MMA) instructions. However, each of these instructions have very specific requirements on how the data needs to be loaded into registers in order to correctly use the hardware. In this post, we go into the details of these requirements and present a general way of representing this register data layout.

Register Data Layout

The figure below shows the register layout of the A matrix for the NVIDIA mma.m16n8k16 operation.

This instruction multiplies a 16 x 16 $A$ matrix and a 16 x 8 $B$ matrix and adds it to a 16 x 8 $C$ matrix to produces a 16 x 8 $D$ matrix. This is shown mathematically below.

$D = A \times B + C$

The figure above shows the 16 x 16 $A$ matrix where each element of the matrix is annotated with the id of the thread that owns it. The way these instructions work is that multiple threads within a warp/wave work together to load the input matrices and then hand off to the specialized hardware to compute the matrix multiplication.