Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improvements to the INT8 GEMM portion of the code for Power #20595

Open
wants to merge 13 commits into
base: main
Choose a base branch
from

Conversation

ChipKerchner
Copy link

@ChipKerchner ChipKerchner commented May 7, 2024

These are changes to improve GEMM portion of the code for Power.

There are 2 main code changes :

  1. Changing a function to a template parameter so that operations that add/sub zero are eliminated at compile time. Plus reuse a vector that has the mask instead of rebuilding each time.
  2. Add processing 16 columns at a time in MlasGemmQuantCopyPackB8x8 - this should reduce potential page faults by a factor of 4 and also be faster.
  3. Unroll MlasQgemmStoreVectorMMA and vectorize other variables.

@ChipKerchner ChipKerchner requested a review from a team as a code owner May 7, 2024 13:09
@yuslepukhin
Copy link
Member

Cc: @chenfucn

@ChipKerchner
Copy link
Author

Questions about the lint warnings. Do all lines have to be less than 120 characters? And for statements that span multiple lines, how are they written?

@yuslepukhin
Copy link
Member

Typically, we employ clangformat. I use a visual cue in the IDE and break them manually, then run the formatter again.

@ChipKerchner
Copy link
Author

ChipKerchner commented May 8, 2024

Which style should I be using for clang-format? microsoft?

It seems the formatter wants to change a lot of code that I did not alter.

@yuslepukhin
Copy link
Member

Which style should I be using for clang-format? microsoft?

It seems the formatter wants to change a lot of code that I did not alter.

Your editor should pick this up automatically.
https://github.com/microsoft/onnxruntime/blob/main/.clang-format

@yufenglee
Copy link
Member

/azp run Windows ARM64 QNN CI Pipeline,Windows x64 QNN CI Pipeline,Windows CPU CI Pipeline,Windows GPU CI Pipeline,Windows GPU TensorRT CI Pipeline,ONNX Runtime Web CI Pipeline,Linux CPU CI Pipeline,Linux CPU Minimal Build E2E CI Pipeline,Linux GPU CI Pipeline,Linux GPU TensorRT CI Pipeline

@yufenglee
Copy link
Member

/azp run Linux OpenVINO CI Pipeline,Linux QNN CI Pipeline,MacOS CI Pipeline,orttraining-amd-gpu-ci-pipeline,orttraining-linux-ci-pipeline,orttraining-linux-gpu-ci-pipeline,orttraining-ortmodule-distributed,onnxruntime-binary-size-checks-ci-pipeline,Big Models,Android CI Pipeline

@yufenglee
Copy link
Member

/azp run iOS CI Pipeline,ONNX Runtime React Native CI Pipeline

Copy link

Azure Pipelines successfully started running 2 pipeline(s).

Copy link

Azure Pipelines successfully started running 10 pipeline(s).

1 similar comment
Copy link

Azure Pipelines successfully started running 10 pipeline(s).

@ChipKerchner
Copy link
Author

I use vim as my editor. I'm not sure it will pickup lint formatting.

@yuslepukhin
Copy link
Member

I use vim as my editor. I'm not sure it will pickup lint formatting.

Most of the failures about extra space. Lots of editors show non-visible characters.

Copy link
Contributor

@edgchen1 edgchen1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you share performance measurements to show how much improvement there is?

onnxruntime/core/mlas/lib/power/qgemm_kernel_power10.cpp Outdated Show resolved Hide resolved
onnxruntime/core/mlas/lib/power/qgemm_kernel_power10.cpp Outdated Show resolved Hide resolved
onnxruntime/core/mlas/lib/power/qgemm_kernel_power10.cpp Outdated Show resolved Hide resolved
@ChipKerchner
Copy link
Author

I'm seeing about a 2.6-4X improvement for PackB

@yufenglee
Copy link
Member

/azp run Windows ARM64 QNN CI Pipeline,Windows x64 QNN CI Pipeline,Windows CPU CI Pipeline,Windows GPU CI Pipeline,Windows GPU TensorRT CI Pipeline,ONNX Runtime Web CI Pipeline,Linux CPU CI Pipeline,Linux CPU Minimal Build E2E CI Pipeline,Linux GPU CI Pipeline,Linux GPU TensorRT CI Pipeline

@yufenglee
Copy link
Member

/azp run Linux OpenVINO CI Pipeline,Linux QNN CI Pipeline,MacOS CI Pipeline,orttraining-amd-gpu-ci-pipeline,orttraining-linux-ci-pipeline,orttraining-linux-gpu-ci-pipeline,orttraining-ortmodule-distributed,onnxruntime-binary-size-checks-ci-pipeline,Big Models,Android CI Pipeline

Copy link

Azure Pipelines successfully started running 10 pipeline(s).

1 similar comment
Copy link

Azure Pipelines successfully started running 10 pipeline(s).

@ChipKerchner ChipKerchner changed the title Improvements to the GEMM portion of the code for Power Improvements to the INT8 GEMM portion of the code for Power May 16, 2024
@ChipKerchner
Copy link
Author

Can we move forward with this PR?

@yuslepukhin
Copy link
Member

/azp run Windows ARM64 QNN CI Pipeline,Windows x64 QNN CI Pipeline,Windows CPU CI Pipeline,Windows GPU CI Pipeline,Windows GPU TensorRT CI Pipeline,ONNX Runtime Web CI Pipeline,Linux CPU CI Pipeline,Linux CPU Minimal Build E2E CI Pipeline,Linux GPU CI Pipeline,Linux GPU TensorRT CI Pipeline

@yuslepukhin
Copy link
Member

/azp run Linux OpenVINO CI Pipeline,Linux QNN CI Pipeline,MacOS CI Pipeline,orttraining-amd-gpu-ci-pipeline,orttraining-linux-ci-pipeline,orttraining-linux-gpu-ci-pipeline,orttraining-ortmodule-distributed,onnxruntime-binary-size-checks-ci-pipeline,Big Models,Android CI Pipeline

Copy link

Azure Pipelines successfully started running 10 pipeline(s).

1 similar comment
Copy link

Azure Pipelines successfully started running 10 pipeline(s).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants