Performance Tuning Tools

Here are the tools for tuning your ONNX Runtime inference models across different Execution Providers and programming languages.

ONNX Go Live (OLive) Tool
onnxruntime_perf_test.exe tool
Ort_perf_view tool

ONNX Go Live (OLive) Tool

The ONNX Go Live (OLive) tool is a Python package that automates the process of accelerating models with ONNX Runtime (ORT).

It contains two parts:

model conversion to ONNX with correctness checking
auto performance tuning with ORT

Users can run these two together through a single pipeline or run them independently as needed.

As a quick start to using the Microsoft ONNX OLive tool, please refer to the notebook tutorials and command line examples.

onnxruntime_perf_test.exe tool

The onnxruntime_perf_test.exe tool (available from the build drop) can be used to test various knobs. Please find the usage instructions using onnxruntime_perf_test.exe -h.

You can enable ONNX Runtime latency profiling in the Python code:

import onnxruntime as rt

sess_options = rt.SessionOptions()
sess_options.enable_profiling = True

If you are using the onnxruntime_perf_test.exe tool, you can add -p [profile_file] to enable performance profiling.

Performance and Profiling Report

In both the cases, you will get a JSON file which contains the detailed performance data (threading, latency of each operator, and so on). This file is a standard performance tracing file, and to view it in a user-friendly way, you can open it by using chrome://tracing:

Open Chrome browser
Type chrome://tracing in the address bar
Load the generated JSON file

Profiling CUDA Kernels

To profile Compute Unified Device Architecture (CUDA) kernels, please add ‘cupti’ library to PATH and use onnxruntime binary built from source with --enable_cuda_profiling. The performance numbers from device will then be attached to those from the host.

Single Kernel example:

In the following example, the “Add” operator from the host initiated a CUDA kernel on device named “ort_add_cuda_kernel” which lasted for 33 microseconds.

JSON:

{"cat":"Node", "name":"Add_1234", "dur":17, ...}
{"cat":"Kernel", "name":"ort_add_cuda_kernel", "dur":33, ...}

Multiple Kernels example:

If an operator called multiple kernels during execution, the performance numbers of all those kernels will be listed following the calling sequence:

JSON:

{"cat":"Node", "name":<name of the node>, ...}
{"cat":"Kernel", "name":<name of the kernel called first>, ...}
{"cat":"Kernel", "name":<name of the kernel called next>, ...}

Ort_perf_view tool

ONNX Runtime also offers a tool to render the statistics as a summarized view in the browser.

The tool takes the input as a JSON file and reports the performance of the GPU and CPU in the form of a treemap in the browser.