[Performance] Inference takes longer when session.Run() is being ran on different threads and each thread has its own session #20599

harishmemx · 2024-05-07T19:09:03Z

Describe the issue

I need to run inference of the same model on multiple threads. For this I'm creating multiple sessions and running session.Run() in separate threads. The duration to run 10000 iterations is as follows,
num-streams=1, duration = 35s/stream
num-streams=2, duration = 37s/stream
num-streams=3, duration = 48s/stream

From 2 streams to 3 streams there is a massive difference. The model is very small and it runs on CPU. The CPU has 8 logical cores and 16 total cores, so CPU can easily handle 3 streams.

The following are the environment and session options.

sessionOptions.SetGraphOptimizationLevel(GraphOptimizationLevel::ORT_ENABLE_ALL);
sessionOptions.DisableMemPattern();
sessionOptions.DisableCpuMemArena();
sessionOptions.DisableProfiling();
sessionOptions.DisablePerSessionThreads();
sessionOptions.AddConfigEntry(kOrtSessionOptionsConfigAllowIntraOpSpinning, "0");
sessionOptions.AddConfigEntry(kOrtSessionOptionsConfigAllowInterOpSpinning, "0");
sessionOptions.SetIntraOpNumThreads(1);
sessionOptions.SetInterOpNumThreads(1);
sessionOptions.SetExecutionMode(ExecutionMode::ORT_SEQUENTIAL);
sessionOptions.SetLogSeverityLevel(OrtLoggingLevel::ORT_LOGGING_LEVEL_FATAL);
OrtEnv* environment;
OrtThreadingOptions* envOpts;
const OrtApi g_ort = Ort::GetApi();
g_ort.CreateThreadingOptions(&envOpts);
g_ort.SetGlobalSpinControl(envOpts,0);
g_ort.SetGlobalInterOpNumThreads(envOpts,1);
g_ort.SetGlobalIntraOpNumThreads(envOpts,1);
g_ort.CreateEnvWithGlobalThreadPools(ORT_LOGGING_LEVEL_FATAL,"ort_logger",envOpts,&environment);

env = new Ort::Env(environment);
session = new Ort::Session(*env, model_path, sessionOptions);

To reproduce

Create a cpp file with above onnx environment and multiple sessions with above options. Run the inference on the attached model with each session.Run happening in a different spawned thread.

Urgency

No response

Platform

Linux

OS Version

22.04

ONNX Runtime Installation

Released Package

ONNX Runtime Version or Commit ID

1.17.1

ONNX Runtime API

C++

Architecture

X86

Execution Provider

Default CPU

Execution Provider Library Version

No response

Model File

yolov7-tiny_416_post.zip

Is this a quantized model?

Yes

pranavsharma · 2024-05-07T23:56:28Z

You don't have to create a session for each thread. Just create one session and call Run() on it in multiple threads using the same session object. Run() is safe to be invoked concurrently.

harishmemx · 2024-05-08T16:13:47Z

You don't have to create a session for each thread. Just create one session and call Run() on it in multiple threads using the same session object. Run() is safe to be invoked concurrently.

I tried both and there is performance degradation in both the cases. To directly reproduce the numbers, below is the code.

#include <onnxruntime/onnxruntime_cxx_api.h>
#include <onnxruntime/onnxruntime_session_options_config_keys.h>
#include <vector>
#include <thread>
#include <iostream>

std::vector<size_t> tensor_sizes{689520,172380,43095};   
std::vector<char*> in_node_names = {"/model/model.77/m.0/Conv_output_0","/model/model.77/m.1/Conv_output_0","/model/model.77/m.2/Conv_output_0"};
std::vector<char*> out_node_names = {"output"};
std::vector<std::vector<int64_t>> node_dims = {{1,255,52,52},{1,255,26,26},{1,255,13,13}};   
std::vector<Ort::Value> inputTensors;

size_t num_input_nodes = 3;
size_t num_output_nodes = 1;

Ort::Session* session;

void run_inference(){
    std::chrono::milliseconds start = std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::system_clock::now().time_since_epoch());
    for(int i = 0; i<1000;++i)
    std::vector<Ort::Value> outputTensors = session->Run(Ort::RunOptions{ nullptr }, 
                in_node_names.data(), inputTensors.data(), num_input_nodes, 
                out_node_names.data(), num_output_nodes);

    std::chrono::milliseconds duration =
            std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::system_clock::now().time_since_epoch()) - start;
    std::cout<<duration.count()<<std::endl;
}
int main(){
    Ort::SessionOptions sessionOptions;
    sessionOptions.SetGraphOptimizationLevel(GraphOptimizationLevel::ORT_ENABLE_ALL);
    sessionOptions.DisableMemPattern();
    sessionOptions.DisableCpuMemArena();
    sessionOptions.DisableProfiling();
    sessionOptions.DisablePerSessionThreads();
    sessionOptions.AddConfigEntry(kOrtSessionOptionsConfigAllowIntraOpSpinning, "0");
    sessionOptions.AddConfigEntry(kOrtSessionOptionsConfigAllowInterOpSpinning, "0");
    sessionOptions.SetIntraOpNumThreads(1);
    sessionOptions.SetInterOpNumThreads(1);
    sessionOptions.SetExecutionMode(ExecutionMode::ORT_SEQUENTIAL);
    sessionOptions.SetLogSeverityLevel(OrtLoggingLevel::ORT_LOGGING_LEVEL_FATAL);

    OrtEnv* environment;
    OrtThreadingOptions* envOpts;
    const OrtApi g_ort = Ort::GetApi();
    g_ort.CreateThreadingOptions(&envOpts);
    g_ort.SetGlobalSpinControl(envOpts,0);
    g_ort.SetGlobalInterOpNumThreads(envOpts,1);
    g_ort.SetGlobalIntraOpNumThreads(envOpts,1);
    g_ort.CreateEnvWithGlobalThreadPools(ORT_LOGGING_LEVEL_FATAL,"ort_logger",envOpts,&environment);

    char* model = strdup("Yolo_test");
    char* model_path = strdup("yolov7-tiny_416_post.onnx");
    Ort::Env* env = new Ort::Env(environment);
    session = new Ort::Session(*env, model_path, sessionOptions);

    Ort::MemoryInfo memoryInfo = Ort::MemoryInfo::CreateCpu(
                                        OrtAllocatorType::OrtArenaAllocator, OrtMemType::OrtMemTypeDefault);

    std::vector<float*> ifmap;
    ifmap.push_back(new float[689520]);   
    ifmap.push_back(new float[172380]);  
    ifmap.push_back(new float[43095]);         
                  
    for (size_t i = 0; i < num_input_nodes; i++)
    {
        inputTensors.emplace_back(Ort::Value::CreateTensor<float>(
            memoryInfo, ifmap[i], (tensor_sizes[i]), 
            node_dims[i].data(), node_dims[i].size()));
    }   

    std::thread t1 = std::thread(&run_inference);
    std::thread t2 = std::thread(&run_inference);
    std::thread t3 = std::thread(&run_inference);
    t3.join();
    t1.join();
    t2.join();
}

github-actions bot added the quantization issues related to quantization label May 7, 2024

edgchen1 added the performance issues related to performance regressions label May 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Performance] Inference takes longer when session.Run() is being ran on different threads and each thread has its own session #20599

[Performance] Inference takes longer when session.Run() is being ran on different threads and each thread has its own session #20599

harishmemx commented May 7, 2024

pranavsharma commented May 7, 2024

harishmemx commented May 8, 2024

[Performance] Inference takes longer when session.Run() is being ran on different threads and each thread has its own session #20599

[Performance] Inference takes longer when session.Run() is being ran on different threads and each thread has its own session #20599

Comments

harishmemx commented May 7, 2024

Describe the issue

To reproduce

Urgency

Platform

OS Version

ONNX Runtime Installation

ONNX Runtime Version or Commit ID

ONNX Runtime API

Architecture

Execution Provider

Execution Provider Library Version

Model File

Is this a quantized model?

pranavsharma commented May 7, 2024

harishmemx commented May 8, 2024