Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance improvement to improve training time #11718

Open
wants to merge 52 commits into
base: main
Choose a base branch
from

Conversation

edkazcarlson
Copy link

@edkazcarlson edkazcarlson commented May 7, 2024

In this PR I am cleaning up some existing code to help improve training times.
The main speedups come from not running transforms that are passed with a 0% chance of applying their transform but I also introduced jit compiled methods with the numba library to help handle some operations that are ran frequently.
If the team doesn't want to use numba (code clutter, licensing, etc) I'll remove as while they did introduce some speedup, a majority of the speedup was from the changes to the image transformations.

Tested with the following code

start_time = time.time()

modelName = f'yolov8n.yaml' 
overrides = {'epochs': 6, 'imgsz': 640, 'data': 'coco.yaml', 'model': modelName, 'batch': 8, 'close_mosaic': 3}
trainer = DetectionTrainer(overrides=overrides)
trainer.train()

end_time = time.time()
execution_time = end_time - start_time
print(f"Execution time: {execution_time} seconds")

image
Overall with my changes it took 9754 seconds to train for 6 epochs (3 with mosaic, 3 without) where it would take 10531 without my changes.

🛠️ PR Summary

Made with ❤️ by Ultralytics Actions

🌟 Summary

Optimizations and improvements in YOLOv8 model processing and data augmentation techniques.

📊 Key Changes

  • Modified bounding box scaling to be more flexible and specific.
  • Implemented enhancements in data augmentation through numba for faster execution.
  • Streamlined and optimized various operations (flipping, translation, etc.) using numba.
  • Adjustments in dataset handling for more efficient processing and augmentation application.

🎯 Purpose & Impact

  • Enhanced Performance: The use of numba for just-in-time compilation significantly speeds up data processing, especially in augmentation tasks, leading to reduced training times.
  • Improved Accuracy: By refining how bounding boxes and images are scaled and transformed, the model can potentially achieve better training accuracy.
  • Flexible Data Handling: Changes in the way datasets and images are augmented allow for more complex and varied transformations, which can help the model generalize better over diverse data sets.

@glenn-jocher
Copy link
Member

Thanks for this pull request and your efforts to enhance the training performance of the YOLOv8 models! 🚀 The optimizations to the image transformations seem particularly promising as they effectively cut down operational redundancy. Using numba for jit compilation is an intriguing choice, and while it does add external dependency, the speed improvements you've observed might justify its use.

Could you provide some benchmarks comparing the performance with and without the use of numba? This data could help in making a more informed decision about integrating numba more broadly within the codebase.

Here’s a slight modification to the test snippet to include timing for both scenarios:

import time
import numba  # Remove if not using numba

def train_model(use_numba):
    start_time = time.time()

    modelName = 'yolov8n.yaml'
    overrides = {'epochs': 6, 'imgsz': 640, 'data': 'coco.yaml', 'model': modelName, 'batch': 8, 'close_mosaic': 3}
    trainer = DetectionTrainer(overrides=overrides)
    trainer.train()

    end_time = time.time()
    execution_time = end_time - start_time
    print(f"Execution time with{'out' if not use_numba else ''} numba: {execution_time} seconds")

train_model(use_numba=True)
train_model(use_numba=False)

This modification helps to directly compare the impact of numba on the overall training process. Looking forward to your findings!

@glenn-jocher glenn-jocher added the TODO Items that needs completing label May 8, 2024
@glenn-jocher
Copy link
Member

@Laughing-q interesting training speedup PR. I think we'd like to merge this without the numba addition as it may be hardware specific and we'd strongly prefer to avoid adding additional dependencies.

Copy link

codecov bot commented May 9, 2024

Codecov Report

Attention: Patch coverage is 78.43137% with 11 lines in your changes are missing coverage. Please review.

Project coverage is 70.54%. Comparing base (b87ea6a) to head (4f1f18c).

❗ Current head 4f1f18c differs from pull request most recent head 92c63fe. Consider uploading reports for the commit 92c63fe to get more accurate results

Files Patch % Lines
ultralytics/utils/instance.py 63.63% 8 Missing ⚠️
ultralytics/data/augment.py 94.73% 1 Missing ⚠️
ultralytics/utils/loss.py 0.00% 1 Missing ⚠️
ultralytics/utils/ops.py 87.50% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##             main   #11718       +/-   ##
===========================================
+ Coverage   37.29%   70.54%   +33.24%     
===========================================
  Files         122      122               
  Lines       15635    15660       +25     
===========================================
+ Hits         5831    11047     +5216     
+ Misses       9804     4613     -5191     
Flag Coverage Δ
Benchmarks 35.31% <39.21%> (?)
GPU 37.27% <31.37%> (-0.03%) ⬇️
Tests 66.73% <78.43%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@edkazcarlson
Copy link
Author

image
slightly slower than before when removing numba but still faster than the current state of main

@Burhan-Q Burhan-Q added the enhancement New feature or request label May 10, 2024
@glenn-jocher
Copy link
Member

Thanks for the update! It's great to hear that the performance improved even without numba. If you can share the specific metrics or any additional insights from your latest tests, that would be helpful for finalizing the merge. Let's aim for the best balance of dependency minimization and performance enhancement. 🚀

@edkazcarlson
Copy link
Author

Thanks for the update! It's great to hear that the performance improved even without numba. If you can share the specific metrics or any additional insights from your latest tests, that would be helpful for finalizing the merge. Let's aim for the best balance of dependency minimization and performance enhancement. 🚀

In short my main changes currently are:

  1. Taking better advantage of vectorized methods in order to boost perf
  2. Not applying transforms if they are passed a 0% probability of running, thus saving us (# of transforms x images x epochs x cpu clocks it takes to run random.random())

I don't have any specific metrics outside of wallclock time for the epochs. Is there something specific the team would want?
Thanks :)

@glenn-jocher
Copy link
Member

Thanks for detailing the changes! The approach sounds solid, particularly your method to skip transformations when their probability is zero—definitely a smart optimization. 😊

For metrics, if you could provide us with a comparison in wall-clock time between the main branch and your changes (i.e., how much total time each epoch takes on average) across a few runs, that would be ideal. This will help us quantitatively assess the impact of your improvements.

Keep up the fantastic work! Looking forward to integrating these enhancements. 🌟

@Laughing-q
Copy link
Member

Laughing-q commented May 13, 2024

@edkazcarlson Hi, thanks for the PR!
I tested your changes locally but the results I got are almost the same as main branch. Here's my results:
on main branch:
pic-240513-1946-15
on current PR:
pic-240513-2037-44
And my testing command:

yolo train detect data=coco.yaml model=yolov8m.yaml batch=64 epochs=4 close_mosaic=2 device=0,1,2,3

@edkazcarlson
Copy link
Author

edkazcarlson commented May 15, 2024

@Laughing-q Thank you for the tests on your hardware, could you try doing this through python itself? Not sure where exactly the yolo command is pointing, could it be that it's pointing at the install you have through pip and not my branch (can you confirm through using which ?)
I haven't had much of a chance between work and other stuff so haven't gotten much of a chance to do in depth tests but from what I can see just initially comparing my old tests to some new perf tests it seems like the new main is slower, checking out an old commit (ex: 1365fe9 ) seems to be faster than the branch thats merged w/ main.
I plan to keep investigating this slowdown in the merged commits but should we merge this into main for the time being just so this branch doesnt get stale? I don't see this PR hurting perf in any way (can change title just for better record keeping)

@glenn-jocher glenn-jocher removed the TODO Items that needs completing label May 15, 2024
@edkazcarlson edkazcarlson changed the title Performance improvement to improve training time by up to ~14% Performance improvement to improve training time May 15, 2024
@edkazcarlson
Copy link
Author

Also side question, I know numba got denied due to hardware compat reasons, but does the team accept cython improvements?

@glenn-jocher
Copy link
Member

Hi there! Thanks for your continued contributions and for checking in about Cython. Yes, we're open to considering Cython improvements as they can be a great way to enhance performance while maintaining compatibility across various hardware setups. If you have specific optimizations in mind using Cython, feel free to share them or open a PR. We'd love to take a look! 🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants