triton.testing.do_bench_cudagraph_proton

triton.testing.do_bench_cudagraph_proton(fn, rep=20, grad_to_none=None, quantiles=None, return_mode='mean')

Benchmark the runtime of kernels invoked by the provided function using the Proton profiler and CUDA graphs. This function is similar to do_bench_cudagraph that avoids CPU overhead by replaying a CUDA graph with multiple iterations of the provided function, but it uses the Proton profiler to measure the runtime of each kernel in the graph instead of using CUDA events to measure the total runtime of the graph. This allows us to get more fine-grained measurements of the kernel runtimes and to exclude cache flushes from the measurement. Note that this function has several constraints compared to do_bench_cudagraph: - It does not measure GPU operations other than kernels (e.g., memory copies, synchronization, etc.). - It supports only the NVIDIA GPU. AMD GPU is a TODO.

Parameters:

fn (Callable) – Function to benchmark
rep (int) – Repetition time (in ms)
grad_to_none (torch.tensor, optional) – Reset the gradient of the provided tensor to None
return_mode (str) – The statistical measure to return. Options are “min”, “max”, “mean”, “median”, or “all”. Default is “mean”.

Returns:

The runtime(s) in milliseconds: a single float for a scalar return_mode, or a list of floats if quantiles is set or return_mode="all".

Return type:

float | list[float]