-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Performance] Inference takes longer when session.Run() is being ran on different threads and each thread has its own session #20599
Comments
You don't have to create a session for each thread. Just create one session and call Run() on it in multiple threads using the same session object. Run() is safe to be invoked concurrently. |
I tried both and there is performance degradation in both the cases. To directly reproduce the numbers, below is the code. #include <onnxruntime/onnxruntime_cxx_api.h>
#include <onnxruntime/onnxruntime_session_options_config_keys.h>
#include <vector>
#include <thread>
#include <iostream>
std::vector<size_t> tensor_sizes{689520,172380,43095};
std::vector<char*> in_node_names = {"/model/model.77/m.0/Conv_output_0","/model/model.77/m.1/Conv_output_0","/model/model.77/m.2/Conv_output_0"};
std::vector<char*> out_node_names = {"output"};
std::vector<std::vector<int64_t>> node_dims = {{1,255,52,52},{1,255,26,26},{1,255,13,13}};
std::vector<Ort::Value> inputTensors;
size_t num_input_nodes = 3;
size_t num_output_nodes = 1;
Ort::Session* session;
void run_inference(){
std::chrono::milliseconds start = std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::system_clock::now().time_since_epoch());
for(int i = 0; i<1000;++i)
std::vector<Ort::Value> outputTensors = session->Run(Ort::RunOptions{ nullptr },
in_node_names.data(), inputTensors.data(), num_input_nodes,
out_node_names.data(), num_output_nodes);
std::chrono::milliseconds duration =
std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::system_clock::now().time_since_epoch()) - start;
std::cout<<duration.count()<<std::endl;
}
int main(){
Ort::SessionOptions sessionOptions;
sessionOptions.SetGraphOptimizationLevel(GraphOptimizationLevel::ORT_ENABLE_ALL);
sessionOptions.DisableMemPattern();
sessionOptions.DisableCpuMemArena();
sessionOptions.DisableProfiling();
sessionOptions.DisablePerSessionThreads();
sessionOptions.AddConfigEntry(kOrtSessionOptionsConfigAllowIntraOpSpinning, "0");
sessionOptions.AddConfigEntry(kOrtSessionOptionsConfigAllowInterOpSpinning, "0");
sessionOptions.SetIntraOpNumThreads(1);
sessionOptions.SetInterOpNumThreads(1);
sessionOptions.SetExecutionMode(ExecutionMode::ORT_SEQUENTIAL);
sessionOptions.SetLogSeverityLevel(OrtLoggingLevel::ORT_LOGGING_LEVEL_FATAL);
OrtEnv* environment;
OrtThreadingOptions* envOpts;
const OrtApi g_ort = Ort::GetApi();
g_ort.CreateThreadingOptions(&envOpts);
g_ort.SetGlobalSpinControl(envOpts,0);
g_ort.SetGlobalInterOpNumThreads(envOpts,1);
g_ort.SetGlobalIntraOpNumThreads(envOpts,1);
g_ort.CreateEnvWithGlobalThreadPools(ORT_LOGGING_LEVEL_FATAL,"ort_logger",envOpts,&environment);
char* model = strdup("Yolo_test");
char* model_path = strdup("yolov7-tiny_416_post.onnx");
Ort::Env* env = new Ort::Env(environment);
session = new Ort::Session(*env, model_path, sessionOptions);
Ort::MemoryInfo memoryInfo = Ort::MemoryInfo::CreateCpu(
OrtAllocatorType::OrtArenaAllocator, OrtMemType::OrtMemTypeDefault);
std::vector<float*> ifmap;
ifmap.push_back(new float[689520]);
ifmap.push_back(new float[172380]);
ifmap.push_back(new float[43095]);
for (size_t i = 0; i < num_input_nodes; i++)
{
inputTensors.emplace_back(Ort::Value::CreateTensor<float>(
memoryInfo, ifmap[i], (tensor_sizes[i]),
node_dims[i].data(), node_dims[i].size()));
}
std::thread t1 = std::thread(&run_inference);
std::thread t2 = std::thread(&run_inference);
std::thread t3 = std::thread(&run_inference);
t3.join();
t1.join();
t2.join();
} |
Describe the issue
I need to run inference of the same model on multiple threads. For this I'm creating multiple sessions and running session.Run() in separate threads. The duration to run 10000 iterations is as follows,
num-streams=1, duration = 35s/stream
num-streams=2, duration = 37s/stream
num-streams=3, duration = 48s/stream
From 2 streams to 3 streams there is a massive difference. The model is very small and it runs on CPU. The CPU has 8 logical cores and 16 total cores, so CPU can easily handle 3 streams.
The following are the environment and session options.
To reproduce
Create a cpp file with above onnx environment and multiple sessions with above options. Run the inference on the attached model with each session.Run happening in a different spawned thread.
Urgency
No response
Platform
Linux
OS Version
22.04
ONNX Runtime Installation
Released Package
ONNX Runtime Version or Commit ID
1.17.1
ONNX Runtime API
C++
Architecture
X86
Execution Provider
Default CPU
Execution Provider Library Version
No response
Model File
yolov7-tiny_416_post.zip
Is this a quantized model?
Yes
The text was updated successfully, but these errors were encountered: