[BUG] `tensordict.pad_sequence` silently ignores non-tensor attributes in `tensorclass`es or `TensorDict`s #783

egaznep · 2024-05-18T20:49:27Z

Describe the bug

I have some tensorclasses that store an audio file, with some metadata including speaker id and utterance id. I would like to collate these tensorclasses to form a batch, however when I do so, the metadata is lost (the metadata from the first tensordict is kept for every item in the batch) and the user is not warned about this either.

To Reproduce

Steps to reproduce the behavior.

from tensordict import pad_sequence, TensorDict

d1 = TensorDict({'a': torch.tensor([0]), 'b': ['asd']})
d2 = TensorDict({'a': torch.tensor([0]), 'b': ['efg']})

pad_sequence([d1, d2])

TensorDict(
    fields={
        a: Tensor(shape=torch.Size([2, 1]), device=cpu, dtype=torch.int64, is_shared=False),
        b: NonTensorData(data=asd, batch_size=torch.Size([2]), device=None)},
    batch_size=torch.Size([2]),
    device=None,
    is_shared=False)

Expected behavior

I should either get a properly joined tensordict, e.g.,

TensorDict(
    fields={
        a: Tensor(shape=torch.Size([2, 1]), device=cpu, dtype=torch.int64, is_shared=False),
        b: NonTensorData(data=['asd', 'efg'], batch_size=torch.Size([2]), device=None)}, # in a list same shape as the batch_size
    batch_size=torch.Size([2]),
    device=None,
    is_shared=False)

or tensordict.pad_sequence should warn the user that the metadata is being discarded.

Screenshots

System info

tensordict-nightly              2024.5.18

Additional context

Add any other context about the problem here.

Reason and Possible fixes

If you know or suspect the reason for this bug, paste the code lines and suggest modifications.

Checklist

I have checked that there is no similar issue in the repo (required)
I have read the documentation (required)
I have provided a minimal working example to reproduce the bug (required)

The text was updated successfully, but these errors were encountered:

vmoens · 2024-05-20T11:37:09Z

If you pass a list it will be cast to numpy ndarray (this is something we can re-consider in the future)
But if you use a plain string the following code will do what you want I think (given #784)

from tensordict import pad_sequence, TensorDict
import torch

d1 = TensorDict({'a': torch.tensor([1, 1]), 'b': 'asd'})
d2 = TensorDict({'a': torch.tensor([2]), 'b': 'efg'})

print(d1['b'])
print(pad_sequence([d1, d2]))
print(pad_sequence([d1, d2])['b'])

egaznep · 2024-05-20T14:36:14Z

If you pass a list it will be cast to numpy ndarray (this is something we can re-consider in the future) But if you use a plain string the following code will do what you want I think (given #784)
from tensordict import pad_sequence, TensorDict
import torch

d1 = TensorDict({'a': torch.tensor([1, 1]), 'b': 'asd'})
d2 = TensorDict({'a': torch.tensor([2]), 'b': 'efg'})

print(d1['b'])
print(pad_sequence([d1, d2]))
print(pad_sequence([d1, d2])['b'])

Tested this and indeed, it works! Thank you for the quick and neat fix 🙂 Would this change make its way into the next nightly release?

vmoens · 2024-05-22T15:24:55Z

Sorry I dropped the ball on this :(
The PR is almost ready but there's some non trivial issue with Peristent (H5) tensordicts that need to be solved before merging. I'll do my best to do it today!

egaznep · 2024-05-23T16:20:42Z

Hi again,

I noticed that this doesn't work for tensorclasses.

MWE:

@tensorclass
class Sample:
    a: torch.Tensor
    b: str

d1 = Sample(**{'a': torch.tensor([1, 1]), 'b': 'asd'}, batch_size=[])
d2 = Sample(**{'a': torch.tensor([2]), 'b': 'efg'}, batch_size=[])
print(pad_sequence([d1, d2])[1].b) # gives you 'asd' and not 'efg'

egaznep added the bug Something isn't working label May 18, 2024

egaznep assigned vmoens May 18, 2024

vmoens linked a pull request May 20, 2024 that will close this issue

[BugFix] Fix pad_sequence for non tensors #784

Merged

vmoens closed this as completed in #784 May 22, 2024

vmoens reopened this May 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] `tensordict.pad_sequence` silently ignores non-tensor attributes in `tensorclass`es or `TensorDict`s #783

[BUG] `tensordict.pad_sequence` silently ignores non-tensor attributes in `tensorclass`es or `TensorDict`s #783

egaznep commented May 18, 2024 •

edited

vmoens commented May 20, 2024

egaznep commented May 20, 2024

vmoens commented May 22, 2024

egaznep commented May 23, 2024

[BUG] tensordict.pad_sequence silently ignores non-tensor attributes in tensorclasses or TensorDicts #783

[BUG] tensordict.pad_sequence silently ignores non-tensor attributes in tensorclasses or TensorDicts #783

Comments

egaznep commented May 18, 2024 • edited

Describe the bug

To Reproduce

Expected behavior

Screenshots

System info

Additional context

Reason and Possible fixes

Checklist

vmoens commented May 20, 2024

egaznep commented May 20, 2024

vmoens commented May 22, 2024

egaznep commented May 23, 2024

[BUG] `tensordict.pad_sequence` silently ignores non-tensor attributes in `tensorclass`es or `TensorDict`s #783

[BUG] `tensordict.pad_sequence` silently ignores non-tensor attributes in `tensorclass`es or `TensorDict`s #783

egaznep commented May 18, 2024 •

edited