Triton is a programming language designed to simplify the development of highly efficient custom operations for Deep Neural Networks.

You can think of Triton as an alternative to CUDA-C in which compute kernels are single-threaded and automatically parallelized. This is done through a modification of the standard C typesystem which allows for multi-dimensional arrays to be first-class citizen. For example, the Triton-C code for a vector addition would be:

__global__ void add(TYPE* c, TYPE* a, TYPE* b){
   int pid = get_program_id(0);
   float* pa[128] = a + pid*128 + 0 ... 128;
   float* pb[128] = b + pid*128 + 0 ... 128;
   float* pc[128] = c + pid*128 + 0 ... 128;
   *pc = *pa + *pb;

The Triton compiler will then look at this code and figure out how to distribute the blocks of pointers p{a,b,c} across different GPU threads so as to maximize memory coalescing, data-reuse and compute efficiency.

The benefits of this approach are most pronounced when writing compute-bound operations. In fact, it is possible to write matrix-multiplication code for tensor cores on par with cuBLAS — without knowing anything about GPUs!

Triton also provides a Python API that automatically generates, compiles and cache PyTorch custom op C code, so that you never have to write any boilerplate code yourself. Check out the tutorials if you want to learn more.