stuck training on NVIDIA H100 #13010

SoraJung · 2024-05-14T06:57:16Z

Search before asking

I have searched the YOLOv5 issues and discussions and found no similar questions.

Question

I am training my custom dataset on NVIDIA H100 (80GB HBM3, 81008MiB), only single gpu but training stuck after model summary.
It works well on NVIDIA GeForce RTX 2080 Ti, RTX 3090.

I don't know why it does not work on H100.
I need your help.

Training command:
`
root@548fdf5867cc:/usr/src/app# python train.py
train: weights=yolov5s.pt, cfg=, data=data/coco128.yaml, hyp=data/hyps/hyp.scratch-low.yaml, epochs=100, batch_size=16, imgsz=640, rect=False, resume=False, nosave=False, noval=False, noautoanchor=False, noplots=False, evolve=None, evolve_population=data/hyps, resume_evolve=None, bucket=, cache=None, image_weights=False, device=, multi_scale=False, single_cls=False, optimizer=SGD, sync_bn=False, workers=8, project=runs/train, name=exp, exist_ok=False, quad=False, cos_lr=False, label_smoothing=0.0, patience=100, freeze=[0], save_period=-1, seed=0, local_rank=-1, entity=None, upload_dataset=False, bbox_interval=-1, artifact_alias=latest, ndjson_console=False, ndjson_file=False
github: up to date with https://github.com/ultralytics/yolov5 ✅
YOLOv5 🚀 v7.0-312-g1bcd17ee Python-3.10.9 torch-2.0.0 CUDA:0 (NVIDIA H100 80GB HBM3, 81008MiB)

hyperparameters: lr0=0.01, lrf=0.01, momentum=0.937, weight_decay=0.0005, warmup_epochs=3.0, warmup_momentum=0.8, warmup_bias_lr=0.1, box=0.05, cls=0.5, cls_pw=1.0, obj=1.0, obj_pw=1.0, iou_t=0.2, anchor_t=4.0, fl_gamma=0.0, hsv_h=0.015, hsv_s=0.7, hsv_v=0.4, degrees=0.0, translate=0.1, scale=0.5, shear=0.0, perspective=0.0, flipud=0.0, fliplr=0.5, mosaic=1.0, mixup=0.0, copy_paste=0.0
Comet: run 'pip install comet_ml' to automatically track and visualize YOLOv5 🚀 runs in Comet
TensorBoard: Start with 'tensorboard --logdir runs/train', view at http://localhost:6006/

Dataset not found ⚠️, missing paths ['/usr/src/datasets/coco128/images/train2017']
Downloading https://ultralytics.com/assets/coco128.zip to coco128.zip...
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6.66M/6.66M [00:01<00:00, 6.83MB/s]
Dataset download success ✅ (3.4s), saved to /usr/src/datasets

             from  n    params  module                                  arguments

0 -1 1 1 -1 1 2 -1 1 3 -1 1 4 5 6 7 8 9 10 11 -1 1 12 [-1, 6] 1 13 14 15 -1 1 16 [-1, 4] 1 17 18 19 [-1, 14] 1 20 21 22 [-1, 10] 1 23 24 [17, 20, 23] 1 Model summary: ` 3520 models.common.Conv [3, 32, 6, 2, 2]
18560 models.common.Conv [32, 64, 3, 2]
18816 models.common.C3 [64, 64, 1]
73984 models.common.Conv [64, 128, 3, 2]
-1 2 115712 models.common.C3 [128, 128, 2]
-1 1 295424 models.common.Conv [128, 256, 3, 2]
-1 3 625152 models.common.C3 [256, 256, 3]
-1 1 1180672 models.common.Conv [256, 512, 3, 2]
-1 1 1182720 models.common.C3 [512, 512, 1]
-1 1 656896 models.common.SPPF [512, 512, 5]
-1 1 131584 models.common.Conv [512, 256, 1, 1]
0 torch.nn.modules.upsampling.Upsample [None, 2, 'nearest']
0 models.common.Concat [1]
-1 1 361984 models.common.C3 [512, 256, 1, False]
-1 1 33024 models.common.Conv [256, 128, 1, 1]
0 torch.nn.modules.upsampling.Upsample [None, 2, 'nearest']
0 models.common.Concat [1]
-1 1 90880 models.common.C3 [256, 128, 1, False]
-1 1 147712 models.common.Conv [128, 128, 3, 2]
0 models.common.Concat [1]
-1 1 296448 models.common.C3 [256, 256, 1, False]
-1 1 590336 models.common.Conv [256, 256, 3, 2]
0 models.common.Concat [1]
-1 1 1182720 models.common.C3 [512, 512, 1, False]
229245 models.yolo.Detect [80, [[10, 13, 16, 30, 33, 23], [30, 61, 62, 45, 59, 119], [116, 90, 156, 198, 373, 326]], [128, 256, 512]]
214 layers, 7235389 parameters, 7235389 gradients, 16.6 GFLOPs

Additional

No response

The text was updated successfully, but these errors were encountered:

github-actions · 2024-05-14T06:57:39Z

👋 Hello @SoraJung, thank you for your interest in YOLOv5 🚀! Please visit our ⭐️ Tutorials to get started, where you can find quickstart guides for simple tasks like Custom Data Training all the way to advanced concepts like Hyperparameter Evolution.

If this is a 🐛 Bug Report, please provide a minimum reproducible example to help us debug it.

If this is a custom training ❓ Question, please provide as much information as possible, including dataset image examples and training logs, and verify you are following our Tips for Best Training Results.

Requirements

Python>=3.8.0 with all requirements.txt installed including PyTorch>=1.8. To get started:

git clone https://github.com/ultralytics/yolov5  # clone
cd yolov5
pip install -r requirements.txt  # install

Environments

YOLOv5 may be run in any of the following up-to-date verified environments (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled):

Notebooks with free GPU:
Google Cloud Deep Learning VM. See GCP Quickstart Guide
Amazon Deep Learning AMI. See AWS Quickstart Guide
Docker Image. See Docker Quickstart Guide

Status

If this badge is green, all YOLOv5 GitHub Actions Continuous Integration (CI) tests are currently passing. CI tests verify correct operation of YOLOv5 training, validation, inference, export and benchmarks on macOS, Windows, and Ubuntu every 24 hours and on every commit.

Introducing YOLOv8 🚀

We're excited to announce the launch of our latest state-of-the-art (SOTA) object detection model for 2023 - YOLOv8 🚀!

Designed to be fast, accurate, and easy to use, YOLOv8 is an ideal choice for a wide range of object detection, image segmentation and image classification tasks. With YOLOv8, you'll be able to quickly and accurately detect objects in real-time, streamline your workflows, and achieve new levels of accuracy in your projects.

Check out our YOLOv8 Docs for details and get started with:

pip install ultralytics

glenn-jocher · 2024-05-14T08:32:41Z

@SoraJung hey there! It seems like you’re encountering an issue with training on the NVIDIA H100 GPU. Given that it works well on RTX 2080 Ti and RTX 3090, there are a couple possibilities to consider:

Driver Compatibility: Ensure that your NVIDIA drivers and CUDA are compatible with the H100. The H100 is a newer and more advanced GPU, which can sometimes need different driver settings or updates compared to older GPUs like the 2080 Ti or 3090.
PyTorch Version: Since you're using PyTorch 2.0.0, check for any known issues with PyTorch that specifically affect new GPU models like the H100. Sometimes updating or rolling back PyTorch can solve these compatibility issues.
CUDA Version: Double-check you're deploying the correct CUDA version that fully supports your hardware. The hardware might require the latest CUDA toolkit, which should be compatible with your current software and drivers as well.

If everything seems to be in order and the issue persists, could you provide any specific error message or output that stops your training? This might give more insight into what's going wrong. Thanks! 🚀

SoraJung added the question Further information is requested label May 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

stuck training on NVIDIA H100 #13010

stuck training on NVIDIA H100 #13010

SoraJung commented May 14, 2024

github-actions bot commented May 14, 2024

glenn-jocher commented May 14, 2024

stuck training on NVIDIA H100 #13010

stuck training on NVIDIA H100 #13010

Comments

SoraJung commented May 14, 2024

Search before asking

Question

Additional

github-actions bot commented May 14, 2024

Requirements

Environments

Status

Introducing YOLOv8 🚀

glenn-jocher commented May 14, 2024