[BUG] deeplake.util.exceptions.ReadSampleFromChunkError #2741

Pumbaa-peng · 2024-01-15T08:26:02Z

Severity

P0 - Critical breaking issue or missing functionality

Current Behavior

I am using torch.distributed.DistributedSampler(dataset, shuffle=shuffle) to write the dataloader where dataset needs to be read from deeplake, And I load deeplake dataset with def init() in the dataset class. But when I iteratively access the dataloader, I get the following error: deeplake.util.exceptions.ReadSampleFromChunkError: Unable to read sample at index 97 from chunk 'images/chunks/bc4c02f9eec3464e' in tensor images.

Steps to Reproduce

class LoadDeeplakeImagesAndLabels(Dataset):
    # YOLOv5 train_loader/val_loader, loads images and labels for training and validation
    cache_version = 0.6  # dataset labels *.cache version
    rand_interp_methods = [cv2.INTER_NEAREST, cv2.INTER_LINEAR, cv2.INTER_CUBIC, cv2.INTER_AREA, cv2.INTER_LANCZOS4]

    def __init__(self,
                 path,
                 img_size=640,
                 batch_size=16,
                 augment=False,
                 hyp=None,
                 rect=False,
                 image_weights=False,
                 cache_images=False,
                 single_cls=False,
                 stride=32,
                 pad=0.0,
                 min_items=0,
                 prefix=''):
        self.img_size = img_size
        self.augment = augment
        self.hyp = hyp
        self.image_weights = image_weights
        self.rect = False if image_weights else rect
        self.mosaic = self.augment and not self.rect  # load 4 images at a time into a mosaic (only during training)
        self.mosaic_border = [-img_size // 2, -img_size // 2]
        self.stride = stride
        self.path = path
        self.albumentations = Albumentations(size=img_size) if augment else None



        #print("self.img_size",self.img_size)
        #print("self.augment",self.augment)
        #print("self.image_weights",self.image_weights)
        #print("self.rect",self.rect)
        #print("self.mosaic",self.mosaic)
        #a = 1/0
        # 读取数据集
        f = []
        label_path_f = []
        label_f = []
        shape_f = []
        
        username = 'zhanglisheng'
        passwd = '***********'
        if path.endswith("train"):
            dest = f's3://{username}/yolomix-train'
        else:
            dest = f's3://{username}/yolomix-val'
        
        
        
        creds = {
            'aws_access_key_id': username,
            'aws_secret_access_key': passwd,
            'endpoint_url': 'http://172.24.**.**:9000'
                }
        
        #dest = 's3://admin/yolo-mix-train'
        self.dest = dest
        self.creds =  creds
        ds = deeplake.load(dest,creds=creds,read_only=True)
        self.ds = ds

        sample_num= 128
        #sample_num = len(ds)

        
        for i in tqdm(range(sample_num)):
            labels = ds['labels'][i].numpy()
            boxes = ds['boxes'][i].numpy()
            c = np.vstack([labels,boxes.T]).T
            label_f.append(c)
            shapes = ds['shapes'][i].text()
            w = int(shapes.split(':')[0])
            h = int(shapes.split(':')[1])
            shape_f.append([w, h])
            f.append(str(i)+'.jpg')
            label_path_f.append(str(i)+'.txt')
            
        

            
        self.im_files = f
        self.label_files  = label_path_f
        self.labels = label_f
        self.shapes = np.array(shape_f)
        list_of_empty_lists = [[] for _ in range(len(self.im_files))]
        self.segments = tuple(list_of_empty_lists)
def __getitem__(self, index):
        index = self.indices[index]  # linear, shuffled, or image_weights

        hyp = self.hyp
        mosaic = self.mosaic and random.random() < hyp['mosaic']
        if mosaic:
            # Load mosaic
            img, labels = self.load_mosaic(index)
            shapes = None

            # MixUp augmentation
            if random.random() < hyp['mixup']:
                img, labels = mixup(img, labels, *self.load_mosaic(random.randint(0, self.n - 1)))

        else:
            # Load image
            img, (h0, w0), (h, w) = self.load_image(index)

            # Letterbox
            shape = self.batch_shapes[self.batch[index]] if self.rect else self.img_size  # final letterboxed shape
            img, ratio, pad = letterbox(img, shape, auto=False, scaleup=self.augment)
            shapes = (h0, w0), ((h / h0, w / w0), pad)  # for COCO mAP rescaling

            labels = self.labels[index].copy()
            if labels.size:  # normalized xywh to pixel xyxy format
                labels[:, 1:] = xywhn2xyxy(labels[:, 1:], ratio[0] * w, ratio[1] * h, padw=pad[0], padh=pad[1])

            if self.augment:
                img, labels = random_perspective(img,
                                                 labels,
                                                 degrees=hyp['degrees'],
                                                 translate=hyp['translate'],
                                                 scale=hyp['scale'],
                                                 shear=hyp['shear'],
                                                 perspective=hyp['perspective'])

        nl = len(labels)  # number of labels
        if nl:
            labels[:, 1:5] = xyxy2xywhn(labels[:, 1:5], w=img.shape[1], h=img.shape[0], clip=True, eps=1E-3)

        if self.augment:
            # Albumentations
            img, labels = self.albumentations(img, labels)
            nl = len(labels)  # update after albumentations

            # HSV color-space
            augment_hsv(img, hgain=hyp['hsv_h'], sgain=hyp['hsv_s'], vgain=hyp['hsv_v'])

            # Flip up-down
            if random.random() < hyp['flipud']:
                img = np.flipud(img)
                if nl:
                    labels[:, 2] = 1 - labels[:, 2]

            # Flip left-right
            if random.random() < hyp['fliplr']:
                img = np.fliplr(img)
                if nl:
                    labels[:, 1] = 1 - labels[:, 1]

            # Cutouts
            # labels = cutout(img, labels, p=0.5)
            # nl = len(labels)  # update after cutout

        labels_out = torch.zeros((nl, 6))
        if nl:
            labels_out[:, 1:] = torch.from_numpy(labels)

        # Convert
        img = img.transpose((2, 0, 1))[::-1]  # HWC to CHW, BGR to RGB
        img = np.ascontiguousarray(img)

        return torch.from_numpy(img), labels_out, self.im_files[index], shapes

    def load_image(self, i):
        # Loads 1 image from dataset index 'i', returns (im, original hw, resized hw)
        im, f, fn = self.ims[i], self.im_files[i], self.npy_files[i],
        if im is None:  # not cached in RAM
            if fn.exists():  # load npy
                im = np.load(fn)
            else:  # read image
                image = self.ds['images'][i]
                im = image.data()['value']


                #im = cv2.imread(f)  # BGR
                assert im is not None, f'Image Not Found {f}'
            h0, w0 = im.shape[:2]  # orig hw
            r = self.img_size / max(h0, w0)  # ratio
            if r != 1:  # if sizes are not equal
                interp = cv2.INTER_LINEAR if (self.augment or r > 1) else cv2.INTER_AREA
                im = cv2.resize(im, (math.ceil(w0 * r), math.ceil(h0 * r)), interpolation=interp)
            return im, (h0, w0), im.shape[:2]  # im, hw_original, hw_resized
        return self.ims[i], self.im_hw0[i], self.im_hw[i]  # im, hw_original, hw_resized

If I load the deeplake dataset in the init__()__ function of the dataset, and then access it in the getitem function I have a problem.

Expected/Desired Behavior

Customize a dataloader that reads data from deeplake and supports distributed training.

Python Version

Python 3.10.9 (main, Mar 8 2023, 10:47:38) [GCC 11.2.0] on linux

OS

No response

IDE

No response

Packages

No response

Additional Context

No response

Possible Solution

No response

Are you willing to submit a PR?

I'm willing to submit a PR (Thank you!)

The text was updated successfully, but these errors were encountered:

nvoxland-al · 2024-01-18T16:36:28Z

There seems to be an issue with the images/chunks/bc4c02f9eec3464e file. If you are getting specific samples that are not in that file, you don't hit the problem, but iterating over all of them will eventually hit the file and the problem.

That exists on your private s3 bucket and was previously created with different code, right? Had it been originally working, and then that file started throwing an exception? Or was it failing starting from original load?

Pumbaa-peng · 2024-01-19T05:51:01Z

Thank you very much for your response, for the same dataset that exists in MinIO（private s3 bucket）， I have no problem accessing it using python single thread, meaning that my data is not corrupted，
but when I'm using torch for distributed dataloader loading, multi-process access is a problem!
Starting training for 4 epochs...

  Epoch    GPU_mem   box_loss   obj_loss   cls_loss  Instances       Size

0%| | 0/1 [00:00<?, ?it/s]Traceback (most recent call last):
File "/yolov5-pengwenchuang/train_with_config.py", line 682, in
main(opt)
File "/yolov5-pengwenchuang/train_with_config.py", line 572, in main
train(opt.hyp, opt, device, callbacks)
File "/yolov5-pengwenchuang/train_with_config.py", line 320, in train
for i, (imgs, targets, paths, _) in pbar: # batch -------------------------------------------------------------
File "/yolov5-pengwenchuang/utils/dataloaders.py", line 178, in iter
yield next(self.iterator)
File "/opt/conda/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 634, in next
data = self._next_data()
File "/opt/conda/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1346, in _next_data
return self._process_data(data)
File "/opt/conda/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1372, in _process_data
data.reraise()
File "/opt/conda/lib/python3.10/site-packages/torch/_utils.py", line 644, in reraise
raise exception
deeplake.util.exceptions.GetChunkError: Unable to get chunk 'Caught GetChunkError in DataLoader worker process 0.
Original Traceback (most recent call last):
File "/opt/conda/lib/python3.10/site-packages/deeplake/core/chunk_engine.py", line 596, in get_chunk_from_chunk_id
chunk = self.get_chunk(chunk_key, partial_chunk_bytes=partial_chunk_bytes)
File "/opt/conda/lib/python3.10/site-packages/deeplake/core/chunk_engine.py", line 578, in get_chunk
chunk = self.cache.get_deeplake_object(
File "/opt/conda/lib/python3.10/site-packages/deeplake/core/storage/lru_cache.py", line 149, in get_deeplake_object
obj = expected_class.frombuffer(buff, meta, partial=True)
File "/opt/conda/lib/python3.10/site-packages/deeplake/core/chunk/base_chunk.py", line 244, in frombuffer
version, shapes, byte_positions, data = deserialize_chunk(buffer, copy=copy)
File "/opt/conda/lib/python3.10/site-packages/deeplake/core/serialize.py", line 234, in deserialize_chunk
version = str(byts[1 : 1 + len_version], "ascii")
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd8 in position 0: ordinal not in range(128)

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/opt/conda/lib/python3.10/site-packages/deeplake/core/chunk_engine.py", line 2345, in _numpy
sample = self.get_single_sample(
File "/opt/conda/lib/python3.10/site-packages/deeplake/core/chunk_engine.py", line 2086, in get_single_sample
sample = self.get_non_tiled_sample(
File "/opt/conda/lib/python3.10/site-packages/deeplake/core/chunk_engine.py", line 2040, in get_non_tiled_sample
return self.get_basic_sample(
File "/opt/conda/lib/python3.10/site-packages/deeplake/core/chunk_engine.py", line 2016, in get_basic_sample
return self.read_basic_sample_from_chunk(
File "/opt/conda/lib/python3.10/site-packages/deeplake/core/chunk_engine.py", line 1988, in read_basic_sample_from_chunk
chunk = self.get_chunk_from_chunk_id(
File "/opt/conda/lib/python3.10/site-packages/deeplake/core/chunk_engine.py", line 603, in get_chunk_from_chunk_id
raise GetChunkError(chunk_key) from e
deeplake.util.exceptions.GetChunkError: Unable to get chunk 'images/chunks/9acad36088dd4b07'.

The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/opt/conda/lib/python3.10/site-packages/torch/utils/data/_utils/worker.py", line 308, in _worker_loop
data = fetcher.fetch(index)
File "/opt/conda/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 51, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/opt/conda/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 51, in
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/yolov5-pengwenchuang/utils/dataloaders.py", line 967, in getitem
img, labels = self.load_mosaic(index)
File "/yolov5-pengwenchuang/utils/dataloaders.py", line 1098, in load_mosaic
img, _, (h, w) = self.load_image(index)
File "/yolov5-pengwenchuang/utils/dataloaders.py", line 1067, in load_image
im = self.ds['images'][i].numpy()
File "/opt/conda/lib/python3.10/site-packages/deeplake/core/tensor.py", line 866, in numpy
ret = self.chunk_engine.numpy(
File "/opt/conda/lib/python3.10/site-packages/deeplake/core/chunk_engine.py", line 1910, in numpy
return (self._sequence_numpy if self.is_sequence else self._numpy)(
File "/opt/conda/lib/python3.10/site-packages/deeplake/core/chunk_engine.py", line 2352, in _numpy
raise GetChunkError(
deeplake.util.exceptions.GetChunkError: Unable to get chunk 'images/chunks/9acad36088dd4b07' while retrieving data at index 97 in tensor images.
'.
Exception in thread Thread-4 (_pin_memory_loop):
Traceback (most recent call last):
File "/opt/conda/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
self.run()
File "/opt/conda/lib/python3.10/threading.py", line 953, in run
self._target(*self._args, self._kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/utils/data/_utils/pin_memory.py", line 51, in _pin_memory_loop
do_one_step()
File "/opt/conda/lib/python3.10/site-packages/torch/utils/data/_utils/pin_memory.py", line 28, in do_one_step
r = in_queue.get(timeout=MP_STATUS_CHECK_INTERVAL)
File "/opt/conda/lib/python3.10/multiprocessing/queues.py", line 122, in get
return _ForkingPickler.loads(res)
File "/opt/conda/lib/python3.10/site-packages/torch/multiprocessing/reductions.py", line 307, in rebuild_storage_fd
fd = df.detach()
File "/opt/conda/lib/python3.10/multiprocessing/resource_sharer.py", line 57, in detach
with _resource_sharer.get_connection(self._id) as conn:
File "/opt/conda/lib/python3.10/multiprocessing/resource_sharer.py", line 86, in get_connection
c = Client(address, authkey=process.current_process().authkey)
File "/opt/conda/lib/python3.10/multiprocessing/connection.py", line 508, in Client
answer_challenge(c, authkey)
File "/opt/conda/lib/python3.10/multiprocessing/connection.py", line 752, in answer_challenge
message = connection.recv_bytes(256) # reject large message
File "/opt/conda/lib/python3.10/multiprocessing/connection.py", line 216, in recv_bytes
buf = self._recv_bytes(maxlength)
File "/opt/conda/lib/python3.10/multiprocessing/connection.py", line 414, in _recv_bytes
buf = self._recv(4)
File "/opt/conda/lib/python3.10/multiprocessing/connection.py", line 379, in _recv
chunk = read(handle, remaining)
ConnectionResetError: [Errno 104] Connection reset by peer
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1702616 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 1702617) of binary: /opt/conda/bin/python
Traceback (most recent call last):
File "/opt/conda/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/opt/conda/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 798, in
main()
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init**.py", line 346, in wrapper
return f(*args, kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 794, in main
run(args)
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in call**
return launch_agent(self._config, self._entrypoint, list(args))
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

train_with_config.py FAILED

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2024-01-19_05:50:35
host : 44401ebcee75
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 1702617)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Pumbaa-peng · 2024-01-19T06:02:49Z

The attachment contains the code to create torch.utils.dataset(LoadDeeplakeImagesAndLabels) and dataloader(create_dataloader) using deeplake
dataloaders.pdf

Pumbaa-peng · 2024-01-19T06:06:00Z

I found that the above problem can be solved by using deeplake.load() with the parameter access_method set to f'local:4', but this way there is no guarantee that the most recent dataset is used every time.

Pumbaa-peng added the bug Something isn't working label Jan 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] deeplake.util.exceptions.ReadSampleFromChunkError #2741

[BUG] deeplake.util.exceptions.ReadSampleFromChunkError #2741

Pumbaa-peng commented Jan 15, 2024 •

edited by nvoxland-al

nvoxland-al commented Jan 18, 2024

Pumbaa-peng commented Jan 19, 2024 •

edited

Pumbaa-peng commented Jan 19, 2024

Pumbaa-peng commented Jan 19, 2024

[BUG] deeplake.util.exceptions.ReadSampleFromChunkError #2741

[BUG] deeplake.util.exceptions.ReadSampleFromChunkError #2741

Comments

Pumbaa-peng commented Jan 15, 2024 • edited by nvoxland-al

Severity

Current Behavior

Steps to Reproduce

Expected/Desired Behavior

Python Version

OS

IDE

Packages

Additional Context

Possible Solution

Are you willing to submit a PR?

nvoxland-al commented Jan 18, 2024

Pumbaa-peng commented Jan 19, 2024 • edited

train_with_config.py FAILED

Failures: <NO_OTHER_FAILURES>

Root Cause (first observed failure): [0]: time : 2024-01-19_05:50:35 host : 44401ebcee75 rank : 1 (local_rank: 1) exitcode : 1 (pid: 1702617) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Pumbaa-peng commented Jan 19, 2024

Pumbaa-peng commented Jan 19, 2024

Pumbaa-peng commented Jan 15, 2024 •

edited by nvoxland-al

Pumbaa-peng commented Jan 19, 2024 •

edited

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2024-01-19_05:50:35
host : 44401ebcee75
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 1702617)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html