Model Deployment

1 Performance Testing

[Note] Before formally deploying the AI model, we strongly recommend that you conduct performance testing of the relevant model on the chip side to ensure that the inference performance of the current model meets expectations.

The spacemit-ort/bin/onnxruntime_perf_test tool in the SDK directory supports quickly testing the pure inference performance of the AI algorithm model on the chip side. This tool is compatible with ONNX models, so you can easily use it to evaluate the performance of the original ONNX floating-point model, as well as the converted (and/or quantized) ONNX fixed-point model.

1.2 Usage Instructions

$ onnxruntime_perf_test -h
perf_test [options...] model_path [result_file]
Options:
        -m [test_mode]: Specifies the test mode. Value could be 'duration' or 'times'.
                Provide 'duration' to run the test for a fix duration, and 'times' to repeated for a certain times. 
        -M: Disable memory pattern.
        -A: Disable memory arena
        -c [parallel runs]: Specifies the (max) number of runs to invoke simultaneously. Default:1.
        -e [cpu|spacemit]: Specifies the provider 'cpu', 'pacemit'. Default:'cpu'.
        -r [repeated_times]: Specifies the repeated times if running in 'times' test mode.Default:1000.
        -t [seconds_to_run]: Specifies the seconds to run for 'duration' mode. Default:600.
        -p [profile_file]: Specifies the profile name to enable profiling and dump the profile data to the file.
        -s: Show statistics result, like P75, P90. If no result_file provided this defaults to on.
        -S: Given random seed, to produce the same input data. This defaults to -1(no initialize).
        -v: Show verbose information.
        -x [intra_op_num_threads]: Sets the number of threads used to parallelize the execution within nodes, A value of 0 means ORT will pick a default. Must >= 0.
        -y [inter_op_num_threads]: Sets the number of threads used to parallelize the execution of the graph (across nodes), A value of 0 means ORT will pick a default. Must >= 0.
        -f [free_dimension_override]: Specifies a free dimension by name to override to a specific value for performance optimization. Syntax is [dimension_name:override_value]. override_value must > 0
        -F [free_dimension_override]: Specifies a free dimension by denotation to override to a specific value for performance optimization. Syntax is [dimension_denotation:override_value]. override_value must > 0
        -P: Use parallel executor instead of sequential executor.
        -o [optimization level]: Default is 99 (all). Valid values are 0 (disable), 1 (basic), 2 (extended), 99 (all).
                Please see onnxruntime_c_api.h (enum GraphOptimizationLevel) for the full list of all optimization levels.
        -u [optimized_model_path]: Specify the optimized model path for saving.
        -z: Set denormal as zero. When turning on this option reduces latency dramatically, a model may have denormals.
        -T [Set intra op thread affinities]: Specify intra op thread affinity string
        [Example]: -T 1, 2; 3, 4; 5, 6 or -T 1 - 2; 3 - 4; 5 - 6 
                Use semicolon to separate configuration between threads.
                E.g. 1, 2; 3, 4; 5, 6 specifies affinities for three threads, the first thread will be attached to the first and second logical processor.
                The number of affinities must be equal to intra_op_num_threads - 1
        -D [Disable thread spinning]: disable spinning entirely for thread owned by onnxruntime intra - op thread pool.
        -H: Maximum value to produce the random input data. This defaults to -1(as std::numeric_limits<T>::max() whenever the value given by this option less than value of '-L').
        -L: Minimum value to produce the random input data. This defaults to 0.
        -R: Count of random generated input test data. This defaults to 1 and must > 0.
        -U: Maximum value to produce the random value of free dimensions which are not overriden. This defaults to 1. Specified value must > 0.
        -V: Minimum value to produce the random value of free dimensions which are not overriden. This defaults to 1. Specified value must > 0.
        -Z [Force thread to stop spinning between runs]: disallow thread from spinning during runs to reduce cpu usage.
        -h: help

1.2 Parameter Description

Parameter	Necessary/Optional	Default Value	Description
-m	Optional	times	Test mode: fixed test duration (s) or test times (Note: The original onnxruntime_perf_test tool defaults to the 'duration' mode)
-M	Optional	None	Disable memory pattern
-A	Optional	None	Disable memory arena
-c	Optional	1	Number of parallel inferences (the number of session.run() triggers at the same time)
-e	Optional	cpu	Provider(s) for inference, separated by " " when used. The currently available EPs include: {cpu
-r	Optional	1000	Number of model inference tests in the fixed test times mode (for each session)
-t	Optional	600	Model inference test time in the fixed test duration mode (for each session), unit: seconds
-p	Optional	None	Profiling file path (default: disabled, non-empty: enabled)
-s	Optional	ON	Print inference time statistics information (if the result file is not specified, it is enabled by default)
-S	Optional	-1	Random seed (default -1 means no random initialization of test data, 0 random seed, > 0 user-specified random seed)
-v	Optional	None	Enable debugging information
-x	Optional	0	Number of parallel threads within a single operator (default 0, i.e., the internal mechanism of onnxruntime)
-y	Optional	0	Number of concurrent execution threads for multiple operators (default 0, i.e., the internal mechanism of onnxruntime)
-f	Optional	None	Specify the value of the free dimension in the model input by the parameter name (string: see abs_free_dimensions.onnx), format key:value
-F	Optional	None	Specify the value of the free dimension in the model input by the notation name (string: see abs_free_dimensions.onnx), format key:value
-P	Optional	None	Enable parallel execution mode
-o	Optional	99	Model optimization level
-u	Optional	None	Save path of the optimized model
-z	Optional	None	Same as session_options.AddConfigEntry(kOrtSessionOptionsConfigSetDenormalAsZero, "1")
-T	Optional	None	Specify the affinity of the threads in the internal thread pool of onnxruntime
-D	Optional	None	Completely disable the idling of the threads in the thread pool related to the concurrent computing within the onnxruntime operator
-H	Optional	-1	Maximum value for randomly generated test data (if less than the minimum value, the maximum value of the corresponding data type is used by default)
-L	Optional	0	Minimum value for randomly generated test data
-R	Optional	1	Number of groups of randomly generated test data
-U	Optional	1	Maximum random data value for free dimensions (usually batch size)
-V	Optional	1	Minimum random data value for free dimensions (usually batch size)
-Z	Optional	None	Forbid the thread from idling during runs to reduce CPU utilization
-h, --help	Optional	None	Print the usage instructions

1.3 Usage Example

Taking the onnxruntime/test/testdata/abs_free_dimensions.onnx model as an example:

1.3.1. Random Test Data

Fix the test times to 100, randomly generate 10 groups of test data, and fix the random seed to 1, the maximum random test data to 6, and the minimum random test data to 2

$ MODEL = abs_free_dimensions.onnx
$ ARGS = "${MODEL} ${MODEL%.onnx}.txt -m times -r 100 -R 10 -S 1 -H 6 -L 2"
$ onnxruntime_perf_test ${ARGS}
...
Session creation time cost: 0.0455992 s
First inference time cost: 0 ms
Total inference time cost: 0.00371454 s
Total inference requests: 100
Average inference time cost: 0.0371454 ms
Total inference run time: 0.00417042 s
Number of inferences per second: 23978.4 
...

2 Application Development

2.1 AI Support Library

2.1.1 Demo Introduction

The current Support Library Demo is located in the bianbu-ai-support directory in the deployment toolkit, and the relevant instructions and examples are as follows:

$ tree -L 3 /opt/spacemit-ai-sdk.v1.1.0/bianbu-ai-support/
/opt/spacemit-ai-sdk.v1.1.0/bianbu-ai-support/
├── bin                 // Precompiled executable programs
│   ├── classification_demo
│   ├── detection_demo
│   ├── detection_stream_demo
│   ├── detection_video_demo
│   ├── estimation_demo
│   └── tracker_stream_demo
├── demo                // Demo cmake project
│   ├── CMakeLists.txt
│   ├── README.md
│   ├── build.sh        // Quick compilation (and testing) script
│   ├── dataloader.hpp
│   ├── image_classification_demo.cc
│   ├── object_detection.hpp
│   ├── object_detection_demo.cc
│   ├── object_detection_stream_demo.cc
│   ├── object_detection_video_demo.cc
│   ├── pose_estimation.hpp
│   ├── pose_estimation_demo.cc
│   ├── pose_tracker_stream_demo.cc
│   └── utils
│       ├── cv_helper.hpp
│       ├── json.hpp
│       ├── json_helper.hpp
│       └── win_getopt
├── include             // Preprocessing, postprocessing, auxiliary function and other modules
│   └── bianbuai
│       ├── task
│       └── utils
├── lib
│   ├── 3rdparty        // Third-party dependency libraries
│   │   └── opencv4
│   ├── libbianbuai.so -> libbianbuai.so.1
│   ├── libbianbuai.so.1 -> libbianbuai.so.1.0.15
│   └── libbianbuai.so.1.0.15
└── share
    └── ai-support      // Pre-set resource data
        ├── imgs
        ├── models
        └── videos
16 directories, 24 files

2.1.2 Demo Compilation

Cross-compilation

Cross-compilation is mainly applicable to the PC side (e.g. x86_64 development environment), and the process (example) is as follows:

# Specify the path of the spacemit-ai-sdk
$ SDK = ${PATH_TO_SPACEMIT_AI_SDK} # e.g. /opt/spacemit-ai-sdk.v1.1.0
# Specify the environment variables related to cross-compilation
$ CROSS_TOOL = $SDK/spacemit-gcc/bin/riscv64-unknown-linux-gnu-
$ SYSROOT = $SDK/spacemit-gcc/sysroot
$ BIANBUAI_HOME = $SDK/bianbu-ai-support
$ ORT_HOME = $SDK/spacemit-ort
$ OPENCV_DIR = $SDK/bianbu-ai-support/lib/3rdparty/opencv4/lib/cmake/opencv4
# Create the cmake working directory and compile the demo
$ cd ${BIANBUAI_HOME}/demo
$ mkdir build && pushd build
$ cmake.. -DBIANBUAI_HOME=${BIANBUAI_HOME} -DORT_HOME=${ORT_HOME} -DOpenCV_DIR=${OPENCV_DIR} -DCMAKE_C_COMPILER=${CROSS_TOOL}gcc -DCMAKE_CXX_COMPILER=${CROSS_TOOL}g++ -DCMAKE_SYSROOT=${SYSROOT}
$ make -j4
$ popd

Local Compilation

Local compilation is applicable to the chip side, and the process (example) is as follows:

# Specify the environment variables related to local compilation
$ CROSS_TOOL =
$ SYSROOT =
$ BIANBUAI_HOME = $SDK/bianbu-ai-support  # Specify the version in the latest sdk or the /usr directory
$ ORT_HOME = $SDK/spacemit-ort            # Specify the version in the latest sdk or the /usr directory
$ OPENCV_DIR =                           # Specify the version in the latest sdk or automatically find it through find_package
# Create the cmake working directory and compile the demo
$ cd ${BIANBUAI_HOME}/demo
$ mkdir build && pushd build
$ cmake.. -DBIANBUAI_HOME=${BIANBUAI_HOME} -DORT_HOME=${ORT_HOME} -DOpenCV_DIR=${OPENCV_DIR} -DCMAKE_C_COMPILER=${CROSS_TOOL}gcc -DCMAKE_CXX_COMPILER=${CROSS_TOOL}g++ -DCMAKE_SYSROOT=${SYSROOT}
$ make -j4
$ popd

[Note] The above-related content has been pre-set in the demo/build.sh quick compilation script. You can quickly modify the relevant configuration (such as: ORT_HOME and other variables) by editing the demo/build.sh script. At that time, you can quickly verify the demo compilation by the bash build.sh (cross-compilation) and bash build.sh --native (local compilation) commands.

Quick Compilation

# One-click cross-compilation (e.g. spacemit-ai-sdk.v1.1.0 docker environment)
$ cd /opt/spacemit-ai-sdk.v1.1.0/bianbu-ai-support/demo
$ bash build.sh

2.1.3 Demo Running

Simulation Configuration For the cross-compiled demo program, you can use the pre-installed qemu - riscv64 tool in the deployment tool to achieve simulation running on the PC side. The relevant configuration is as follows:

$ QEMU_CMD = "$SDK / spacemit - qemu / bin / qemu - riscv64 - L $SYSROOT"

Running Example

[Note] For the locally compiled demo program, you do not need to configure any environment variables.

# Create softlink to test resource if necessary
$ ln - sf ${BIANBUAI_HOME} / rootfs / usr / share / ai - support data
# Smoke test with image classification
$ env LD_LIBRARY_PATH = ${ORT_HOME} / lib:$LD_LIBRARY_PATH ${QEMU_CMD} \
  build / classification_demo data / models / squeezenet1.1 - 7.onnx data / labels / synset.txt data / imgs / dog.jpg
# Smoke test with object detection
$ env LD_LIBRARY_PATH = ${ORT_HOME} / lib:$LD_LIBRARY_PATH ${QEMU_CMD} \
  build / detection_demo data / models / nanodet - plus - m_320.onnx data / models / coco.txt data / imgs / person.jpg result0.jpg

[Note] The above-related content has also been pre-installed in the demo / build.sh quick compilation script. You can quickly run the above example (simulation test in the x86_64 docker environment) by the bash build.sh -- test command:

[INF0] Building demos done. [INFO]Prepare... [INFO] Smoke test with image classification task
[INF0] Run:bld / classificat ion_demo data / mode ls / squeezenet1.1 - 7.onnx data / mode ls / synset.txt data / imgs / dog.jpg open tcm device failed(- 1) Enable spacemit ep now tcm heck param err--->fun:tcmmalloc_sync + line:164Classfy result:n02113023 Pembroke, Pembroke Welsh corgi [INFO] Smoke test with object detection task... [INF0] Run: bld / detection_demo data / mode s / nanodet - plus - m_320.onnx data / models / coco.txt data / imgs / person.jpg resulto.jpg open t

2.1.4 Demo Instructions

classification_demo

Single-image image classification demo, input the path of a single image, and output the category of the image.

Running Method

$ classification_demo 
Usage: 
classification_demo <model_path> <label_path> <image_path>
classification_demo <config_path> <image_path>

Parameter Description

Parameter	Required/Optional	Default Value	Remarks
model_path	Required	None	Model file path
label_path	Required	None	Label file path
config_path	Required	None	Configuration file path
image_path	Required	None	Image file path

detection_demo

Single-image object detection demo, input the address of a single image and the address to save the image, output the box information and save the framed image to the target image location.

Running Method

$ detection_demo 
Usage: 
detection_demo <model_path> <label_path> <image_path> <save_path>
detection_demo <config_path> <image_path> <save_path>

Parameter Description

Parameter	Required/Optional	Default Value	Remarks
model_path	Required	None	Model file path
label_path	Required	None	Label file path
config_path	Required	None	Configuration file path
image_path	Required	None	Image file path
save_path	Required	None	Saved image file path

detection_stream_demo

Video stream object detection demo, you can input a video file or access the camera and display the framed picture in real time.

Running Method

$ detection_stream_demo 
Usage: 
detection_stream_demo [-h <resize_height>] [-w <resize_width>] [-f] <model_path> <label_path> <input>
detection_stream_demo [-h <resize_height>] [-w <resize_width>] [-f] <config_path> <input>

Parameter Description

Parameter	Required/Optional	Default Value	Remarks
model_path	Required	None	Model file path
label_path	Required	None	Label file path
config_path	Required	None	Configuration file path
input	Required	None	Input content
-w	Optional	320	Resized width
-h	Optional	320	Resized height
-f	Optional	None	Horizontal flip

detection_video_demo

Video object detection demo, input the address of the video file, will output real-time box information and save the framed video (avi format) to the target address.

Running Method

$ detection_video_demo 
Usage: 
detection_video_demo <model_path> <label_path> <video_path> <save_path>(*.avi)
detection_video_demo <config_path> <video_path> <save_path>(*.avi)

Parameter Description

Parameter	Required/Optional	Default Value	Remarks
model_path	Required	None	Model file path
label_path	Required	None	Label file path
config_path	Required	None	Configuration file path
video_path	Required	None	Video file path (mp4, avi)
save_path	Required	None	Saved video file path

estimation_demo

Single-image pose estimation demo, input the address of a single image and the address to save the image, and save the image with points drawn to the target image location.

Running Method

$ estimation_demo 
Usage: 
estimation_demo <detection_model_path> <detection_label_path> <pose_point_model_path> <image_path> <save_path>
estimation_demo <detection_config_path> <pose_point_config_path> <image_path> <save_path>

Parameter Description

Parameter	Required/Optional	Default Value	Remarks
detection_model_path	Required	None	Object detection model file path
detection_label_path	Required	None	Object detection label file path
pose_point_model_path	Required	None	Pose model file path
detection_config_path	Required	None	Object detection model configuration file path
pose_point_config_path	Required	None	Pose model configuration file path
image_path	Required	None	Image file path
save_path	Required	None	Saved image file path

tracker_stream_demo

Video stream pose tracking demo, you can input a video file or access the camera and display the framed picture in real time.

Running Method

$ tracker_stream_demo 
Usage: 
tracker_stream_demo [-h <resize_height>] [-w <resize_width>] [-f] <detection_model_path> <detection_label_path> <pose_point_model_path> <input>
tracker_stream_demo [-h <resize_height>] [-w <resize_width>] [-f] <detection_config_path> <pose_point_config_path> <input>

Parameter Description

Parameter	Required/Optional	Default Value	Remarks
detection_model_path	Required	None	Object detection model file path
detection_label_path	Required	None	Object detection label file path
pose_point_model_path	Required	None	Pose model file path
detection_config_path	Required	None	Object detection model configuration file path
pose_point_config_path	Required	None	Pose model configuration file path
input	Required	None	Input content
-w	Optional	320	Resized width
-h	Optional	320	Resized height
-f	Optional	None	Horizontal flip

2.1.5 Description of Environment Variables

Environment Variable Name	Remarks
SUPPORT_SHOW	(stream demo) -1 indicates not to display
SUPPORT_SHOWFPS	(stream demo) If there is content, the fps will be displayed
SUPPORT_PROFILING_PROJECTS	The address of the generated profile file
SUPPORT_LOG_LEVEL	The range is 0 - 4
SUPPORT_GRAPH_OPTIMIZATION_LEVEL	Graph optimization level (ort_disable_all, ort_enable_basic, ort_enable_extended, ort_enable_all)
SUPPORT_OPT_MODEL_PATH	The path of the optimized model
SUPPORT_DISABLE_SPACEMIT_EP	1 indicates to disable spacemit - ep
SUPPORT_OPENCV_THREAD_NUM	The number of threads used by opencv (>= 4.x)

2.2 AI Engine

2.2.1 Introduction

SpacemiT - ORT includes the basic inference framework of ONNXRuntime (v1.15.1) and the SpaceMITExecutionProvider acceleration backend (hereinafter referred to as EP), and its usage is almost the same as the public version of ONNXRuntime.

2.2.2 QuickStart

C & C++

#include <onnxruntime_cxx_api.h> 
#include "spacemit_ort_env.h"
Ort::Env env(ORT_LOGGING_LEVEL_WARNING, "ort - demo") ;
Ort::SessionOptions session_options;
// Set the number of inference threads
//int64_t num_threads = 2;
//session_options.SetIntraOpNumThreads(num_threads);
std::unordered_map<std::string, std::string> provider_options;
// provider_options["SPACEMIT_EP_DISABLE_OP_TYPE_FILTER"] = "OPA;OPB;OPC"; Disable EP from inferring certain OP types, node.op
// provider_options["SPACEMIT_EP_DISABLE_OP_NAME_FILTER"] = "OPA;OPB;OPC"; Disable EP from inferring certain named OPs, node.name
SessionOptionsSpaceMITEnvInit(session_options, provider_options); // Optional loading of SpaceMIT environment initialization
Ort::Session session(env, net_param_path, session_options);
//...Subsequent steps are consistent with the public version of ORT

Python

# Install using the whl package
# pip install spacemit_ort - *.whl
# On the riscv64 platform, if a warning is encountered, add -- break - system - packages
# The whl package strips the automatic installation of dependent libraries, and numpy needs to be installed separately
# For the riscv64 platform, use the command apt install python3 - numpy to install
import onnxruntime as ort
import numpy as np
import spacemit_ort
eps = ort.get_available_providers() #
net_param_path = "resnet18.q.onnx"
sess_options = ort.SessionOptions()
# Set the number of threads
# sess_options.intra_op_num_threads = 2
# Set the log level
# sess_options.log_severity_level = 1
# Session with ep
session = ort.InferenceSession(net_param_path, sess_options, providers = ["SpaceMITExecutionProvider"])
# Session without ep
# Because there are 2 EPs, it needs to be specifically specified
ref_session = ort.InferenceSession(net_param_path, sess_options, providers = ["CPUExecutionProvider"])
input_tensor = np.ones((1, 3, 224, 224), dtype = np.float32)
input_name = session.get_inputs()[0].name  
output_names = [output.name for output in session.get_outputs()]  
outputs = session.run(output_names, {input_name: input_tensor})
ref_outputs = ref_session.run(output_names, {input_name: input_tensor})
# The error between outputs and ref_outputs is generally within 1e - 5

2.2.3 Custom Operators plugins

Use the method of extending custom operators in the native onnxruntime. For the original text, please refer to https://onnxruntime.ai/docs/reference/operators/add-custom-op.html

#include "onnxruntime_cxx_api.h" 
struct CustomKernel {
  CustomKernel (const OrtKernelInfo* info);
  void Compute(OrtKernelContext* context);
};
struct CustomOp : Ort::CustomOpBase<CustomOp, CustomKernel> {
  explicit CustomOp ();
  void* CreateKernel(const OrtApi&, const OrtKernelInfo*) const;
  const char* GetName() const { return "custom op"; };
  const char* GetExecutionProviderType() const { return "CPUExecutionProvider"; };
  size_t GetInputTypeCount() const { return 1; };
  ONNXTensorElementDataType GetInputType(size_t) const { return ONNX_TENSOR_ELEMENT_DATA_TYPE_UNDEFINED; };
  OrtCustomOpInputOutputCharacteristic GetInputCharacteristic(size_t) const { return OrtCustomOpInputOutputCharacteristic::INPUT_OUTPUT_OPTIONAL; };
  size_t GetOutputTypeCount() const { return 1; };
  ONNXTensorElementDataType GetOutputType(size_t) const { return ONNX_TENSOR_ELEMENT_DATA_TYPE_UNDEFINED; };
  OrtCustomOpInputOutputCharacteristic GetOutputCharacteristic(size_t) const { return OrtCustomOpInputOutputCharacteristic::INPUT_OUTPUT_OPTIONAL; };
};
// Declare the custom operator domain and add the custom operator to the session_options
static const char* c_OpDomain = "user.custom_domain";
Ort::CustomOpDomain domain{c_OpDomain};
static TestCustomOp CustomOp;
domain.Add(&TestCustomOp());
session_options.Add(domain);

2.2.4 Operator Accelerate List

Op Type	Domain	Version	Attributes	Type	Notes	schema
Conv	ai.onnx	1, 11	kernel_shape: limited to two dimensions	T: tensor(float)	tensor(float16)	QLinearConv
ConvTranspose	ai.onnx	1, 11	kernel_shape: limited to two dimensions	T: tensor(float)	tensor(float16)	QLinearConvTranspose
QlinearMatMul	ai.onnx	10	T1: tensor(int8)\nT2: tensor(int8)\nT3: tensor(int8)	Only supports PerTensor quantization, only supports MatMul where B is a constant; weight quantization only supports symmetric quantization	https://onnx.ai/onnx/operators/onnx__QLinearMatMul.html
Gemm	ai.onnx	1, 6, 7, 9, 11, 13	alpha: limited to 1.0\nbeta: limited to 1.0	T: tensor(float)	https://onnx.ai/onnx/operators/onnx__Gemm.html
QGemm	com.microsoft	1	alpha: limited to 1.0\nbeta: limited to 1.0	T: tensor(float)\nTA: tensor(int8)\nTB: tensor(int8)\nTC: tensor(int8)\nTYZ: tensor(int8)\nTY: tensor(int8)	Only supports PerTensor quantization, only supports constant Gemm; weight quantization only supports symmetric quantization	https://github.com/microsoft/onnxruntime/blob/main/docs/ContribOperators.md#com.microsoft.QGemm
AveragePool	ai.onnx	1, 7, 10, 11, 19	kernel_shape: limited to two dimensions\ncount_include_pad: limited to 1	T: tensor(float)	QLinearAveragePool	com.microsoft
GlobalAveragePool	ai.onnx	1	T: tensor(float)	QLinearGlobalAveragePool	com.microsoft	1
MaxPool	ai.onnx	11, 12	kernel_shape: limited to two dimensions	T: tensor(float)	tensor(int8)
QuantizeLinear	ai.onnx	10, 13, 19	T1: tensor(float)\nT2: tensor(int8)	tensor(int16)
DequantizeLinear	ai.onnx	10, 13, 19	T1: tensor(int8)	tensor(int16)	tensor(int32)\nT2: tensor(float)
Add	ai.onnx	1, 6, 7, 13, 14	T: tensor(float)	QLinearAdd	com.microsoft	1
Sub	ai.onnx	1, 6, 7, 13, 14	T: tensor(float)
Mul	ai.onnx	1, 6, 7, 13, 14	T: tensor(float)	https://github.com/microsoft/onnxruntime/blob/main/docs/ContribOperators.md#com.microsoft.QLinearMul
QLinearMul	com.microsoft	1	T: tensor(int8)
Div	ai.onnx	1, 6, 7, 13, 14	T: tensor(float)
Sigmoid	ai.onnx	1, 6, 13	T: tensor(float)	QLinearSigmoid	com.microsoft	1
HardSigmoid	ai.onnx	1, 6	T: tensor(float)	QLinearHardSigmoid	spacemit_ops	1
HardSwish	ai.onnx	14	T: tensor(float)	QLinearHardSwish	spacemit_ops	1
LeakyRelu	ai.onnx	1, 6, 16	T: tensor(float)	QLinearLeakyRelu	com.microsoft	1
Transpose	ai.onnx	1, 13	T: tensor(int8)	tensor(uint8)
Cast	ai.onnx	1, 6, 9, 13, 19	T1: tensor(float)	tensor(float16)\nT2: tensor(float)	tensor(float16)	https://onnx.ai/onnx/operators/onnx__Cast.html
ReduceMean	ai.onnx	11, 13	axes: limited to [2, 3]	T: tensor(float)	QLinearReduceMean	com.microsoft
QLinearGelu	spacemit_ops	1	T: tensor(int8)
QLinearLayerNormalization	spacemit_ops	1	T: tensor(int8)
LayerNormalization	ai.onnx\ncom.microsoft	17\n1	T: tensor(float)
Gelu	com.microsoft	1	T: tensor(float)

2.2.5 Inference Sample

To make it easier for users to get started, we provide corresponding inference samples. You can see it in the SDK package, in the path spacemit-ort/samples.

3 Frequently Asked Questions (FAQ)

Everyone is welcome to ask questions

3.1 How to view the profiling information of the model inference?

You can refer to the original instructions. https://onnxruntime.ai/docs/performance/tune-performance/profiling-tools.html

#include <onnxruntime_cxx_api.h> 
#include "spacemit_ort_env.h"
Ort::Env env(ORT_LOGGING_LEVEL_WARNING, "ort-demo");
Ort::SessionOptions session_options;
std::unordered_map<std::string, std::string> provider_options;
std::string profile_path = "ort-demo-profile";
// Enable profiling
session_options.EnableProfiling(profile_path.c_str());
std::string opt_net_path = "ort-demo-opt.onnx";
// Enable saving the optimized ONNX model, which can only be used on the current specific platform
session_options.SetOptimizedModelFilePath(opt_net_path.c_str());
SessionOptionsSpaceMITEnvInit(session_options, provider_options);
Ort::Session session(env, net_param_path, session_options);

3.2 How to save the layer-by-layer results during model running?

The dump function of the ONNX model node output Tensor is controlled by a series of environment variables. Here are the explanations of the commonly used environment variables that may be used.

Environment Variable Name	Meaning	Value
ORT_DEBUG_NODE_IO_DUMP_SHAPE_DATA	Print the Shape information of the Tensor at the input and output of the node	0, 1, default is 0
ORT_DEBUG_NODE_IO_DUMP_NODE_PLACEMENT	Print the EP information of the node	0, 1, default is 0
ORT_DEBUG_NODE_IO_DUMP_INPUT_DATA	Dump the data of the input Tensor of the node	0, 1, default is 0
ORT_DEBUG_NODE_IO_DUMP_OUTPUT_DATA	Dump the data of the output Tensor of the node	0, 1, default is 0
ORT_DEBUG_NODE_IO_NAME_FILTER	Filter the name of the dump node	A string separated by semicolons, default is empty
ORT_DEBUG_NODE_IO_OP_TYPE_FILTER	Filter the type of the dump node	A string separated by semicolons, default is empty
ORT_DEBUG_NODE_IO_DUMP_DATA_DESTINATION	Export type of the input and output Tensor of the dump node	The string "stdout" or "files" or "sqlite", generally choose files
ORT_DEBUG_NODE_IO_OUTPUT_DIR	File storage location of the input and output Tensor of the dump node	String
ORT_DEBUG_NODE_IO_DUMPING_DATA_TO_FILES_FOR_ALL_NODES_IS_OK	Confirm whether to export all Tensors	0, 1, default is 0

export ORT_DEBUG_NODE_IO_DUMP_SHAPE_DATA=1
export ORT_DEBUG_NODE_IO_DUMP_OUTPUT_DATA=1
export ORT_DEBUG_NODE_IO_DUMP_DATA_DESTINATION=files
# Specify the directory to export the Tensor file
export ORT_DEBUG_NODE_IO_OUTPUT_DIR=./dump_dir
export ORT_DEBUG_NODE_IO_DUMPING_DATA_TO_FILES_FOR_ALL_NODES_IS_OK=1
export ORT_DEBUG_NODE_IO_DUMP_NODE_PLACEMENT=1
export ORT_DEBUG_NODE_IO_APPEND_RANK_TO_FILE_NAME=1
# export ORT_DEBUG_NODE_IO_OP_TYPE_FILTER="QLinearConv;QLinearGlobalAveragePool"
rm -rf./dump_dir
mkdir -p./dump_dir
# Execute the demo or your program to obtain
./run_demo resnet18 resnet18.q.onnx

Console output

QLinearConv node: SpaceMITExecutionProvider_QLinearConv_20
Input 0 Name: PPQ_Operation_141
 Shape: {1,7,7,512}
Input 1 Name: ortshared_1_0_1_2_token_254
 Shape: {}
Input 2 Name: PPQ_Variable_373
 Shape: {}
Input 3 Name: onnx::Conv_250
 Shape: {512,512,3,3}
Input 4 Name: PPQ_Variable_375
 Shape: {512}
Input 5 Name: PPQ_Variable_376
 Shape: {512}
Input 6 Name: ortshared_1_0_1_3_token_255
 Shape: {}
Input 7 Name: PPQ_Variable_382
 Shape: {}
Input 8 Name: onnx::Conv_251
 was missing data type
 Placement: SpaceMITExecutionProvider
-----------
Output 0 Name: PPQ_Operation_145
 Shape: {1,7,7,512}
 Placement: SpaceMITExecutionProvider
-----------
QLinearGlobalAveragePool node: SpaceMITExecutionProvider_QLinearGlobalAveragePool_21
Input 0 Name: PPQ_Operation_147
 Shape: {1,7,7,512}
Input 1 Name: ortshared_1_0_1_0_token_252
 Shape: {}
Input 2 Name: PPQ_Variable_391
 Shape: {}
Input 3 Name: ortshared_1_0_1_1_token_253
 Shape: {}
Input 4 Name: PPQ_Variable_394
 Shape: {}
 Placement: SpaceMITExecutionProvider
-----------
Output 0 Name: PPQ_Operation_149
 Shape: {1,1,1,512}

Obtain all the outputs of the specified node type in the ./dump_dir directory, stored in the tensorproto format

3.3 How to set multi-threading and thread affinity?

You can refer to the original document to set the thread affinity. Due to the particularity of the architecture, threads 0 - 3 cannot be manually set for affinity and are managed by the ep itself.

https://onnxruntime.ai/docs/performance/tune-performance/threading.html#set-intra-op-thread-affinity

3.4 Do you need to pay attention to the Layout memory arrangement of the Tensor?

The inference library completely follows the definition of the Tensor by ONNXRuntime, that is, the memory layout of NCHW is consistent with the shape description.

3.5 Models inputting to the QLinear operator

There are some official QLinear operators in the ONNX operator set, which can be used directly in the case of static shape, and in other cases, try to use the quantized model in the QDQ format

Model Deployment

1 Performance Testing​

1.2 Usage Instructions​

1.2 Parameter Description​

1.3 Usage Example​

1.3.1. Random Test Data​

2 Application Development​

2.1 AI Support Library​

2.1.1 Demo Introduction​

2.1.2 Demo Compilation​

Cross-compilation​

Local Compilation​

Quick Compilation​

2.1.3 Demo Running​

2.1.4 Demo Instructions​

2.1.5 Description of Environment Variables​

2.2 AI Engine​

2.2.1 Introduction​

2.2.2 QuickStart​

2.2.3 Custom Operators plugins​

2.2.4 Operator Accelerate List​

2.2.5 Inference Sample​

3 Frequently Asked Questions (FAQ)​

3.1 How to view the profiling information of the model inference?​

3.2 How to save the layer-by-layer results during model running?​

3.3 How to set multi-threading and thread affinity?​

3.4 Do you need to pay attention to the Layout memory arrangement of the Tensor?​

3.5 Models inputting to the QLinear operator​