Triton is a programming language designed to simplify the development of highly efficient custom operations for Deep Neural Networks.
You can think of Triton as an alternative to CUDA-C in which compute kernels are single-threaded and automatically parallelized. This is done through a modification of the standard C typesystem which allows for multi-dimensional arrays to be first-class citizen. For example, the Triton-C code for a vector addition would be:
__global__ void add(TYPE* c, TYPE* a, TYPE* b){ int pid = get_program_id(0); float* pa[128] = a + pid*128 + 0 ... 128; float* pb[128] = b + pid*128 + 0 ... 128; float* pc[128] = c + pid*128 + 0 ... 128; *pc = *pa + *pb; }
The Triton compiler will then look at this code and figure out how to distribute the blocks of pointers p{a,b,c} across different GPU threads so as to maximize memory coalescing, data-reuse and compute efficiency.
The benefits of this approach are most pronounced when writing compute-bound operations. In fact, it is possible to write matrix-multiplication code for tensor cores on par with cuBLAS — without knowing anything about GPUs!
Triton also provides a Python API that automatically generates, compiles and cache PyTorch custom op C code, so that you never have to write any boilerplate code yourself. Check out the tutorials if you want to learn more.