Modifying nested column will have no effect #207

wenleix · 2022-02-18T03:48:06Z

To reproduce:

import torcharrow as ta
import torcharrow.dtypes as dt
dtype = dt.Struct(
    [
        dt.Field("labels", dt.int8),
        dt.Field("dense_features", dt.Struct([dt.Field("int_1", dt.int32), dt.Field("int_2", dt.int32)])),
    ]
)
df = ta.DataFrame(
    [
        (1, (0, 1)),
        (0, (10, 11)),
        (1, (20, 21)),
    ],
    dtype=dtype)

Now df looksl like:

>>> df
  index    labels  dense_features
-------  --------  ----------------
      0         1  (0, 1)
      1         0  (10, 11)
      2         1  (20, 21)
dtype: Struct([Field('labels', int8), Field('dense_features', Struct([Field('int_1', int32), Field('int_2', int32)]))]), count: 3, null_count: 0

Try to change df["dense_features"]["int_1"] (and failed):

>>> df["dense_features"]["int_1"] = df["dense_features"]["int_1"] + 1
>>> df
  index    labels  dense_features
-------  --------  ----------------
      0         1  (0, 1)
      1         0  (10, 11)
      2         1  (20, 21)
dtype: Struct([Field('labels', int8), Field('dense_features', Struct([Field('int_1', int32), Field('int_2', int32)]))]), count: 3, null_count: 0

For now, the work around is to first get the nested DF out, apply the transformation, and then put it back:

https://github.com/facebookresearch/torcharrow/blob/6d2bca82e65f74193360bd06c5ab4f8c761c5342/torcharrow/test/integration/test_criteo.py#L149-L157

The problem is DataFrameCpu._set_field_data generates a new RowVector and copy the column vector pointer -- for a nested RowVector, it only updates the leaf level struct but doesn't propagate upwards: https://github.com/facebookresearch/torcharrow/blob/6d2bca82e65f74193360bd06c5ab4f8c761c5342/torcharrow/velox_rt/dataframe_cpu.py#L310-L329

Creating a new RowVector seems necessary, since assigning column to DataFrame may change the children column type. One idea would be allowing the wrapped RowColumn to change the delegated RowVector (e.g. something like self._data._reset_data(new_delegate)) . -- Basically DataFrame is a thin wrapper and everything is in RowColumn.

For this to work, DataFrame.dtype should always use the underlying Velox Vector's type as groundtruth.

The text was updated successfully, but these errors were encountered:

wenleix added the bug Something isn't working label Feb 18, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Modifying nested column will have no effect #207

Modifying nested column will have no effect #207

wenleix commented Feb 18, 2022 •

edited

Modifying nested column will have no effect #207

Modifying nested column will have no effect #207

Comments

wenleix commented Feb 18, 2022 • edited

wenleix commented Feb 18, 2022 •

edited