Support categorical data types for Parquet #5706

kklemon · 2023-04-04T09:45:35Z

Feature request

Huggingface datasets does not seem to support categorical / dictionary data types for Parquet as of now. There seems to be a TODO in the code for this feature but no implementation yet. Below you can find sample code to reproduce the error that is currently thrown when attempting to read a Parquet file with categorical columns:

import pandas as pd
import pyarrow.parquet as pq

from datasets import load_dataset

# Create categorical sample DataFrame
df = pd.DataFrame({'type': ['foo', 'bar']}).astype('category')
df.to_parquet('data.parquet')

# Read back as pyarrow table
table = pq.read_table('data.parquet')
print(table.schema)
# type: dictionary<values=string, indices=int32, ordered=0>

# Load with huggingface datasets
load_dataset('parquet', data_files='data.parquet')

Error:

Traceback (most recent call last):                                                                                                                                                                                                        
  File ".venv/lib/python3.10/site-packages/datasets/builder.py", line 1875, in _prepare_split_single
    writer.write_table(table)
  File ".venv/lib/python3.10/site-packages/datasets/arrow_writer.py", line 566, in write_table
    self._build_writer(inferred_schema=pa_table.schema)
  File ".venv/lib/python3.10/site-packages/datasets/arrow_writer.py", line 379, in _build_writer
    inferred_features = Features.from_arrow_schema(inferred_schema)                                                                                     
  File ".venv/lib/python3.10/site-packages/datasets/features/features.py", line 1622, in from_arrow_schema
    obj = {field.name: generate_from_arrow_type(field.type) for field in pa_schema}
  File ".venv/lib/python3.10/site-packages/datasets/features/features.py", line 1622, in <dictcomp>
    obj = {field.name: generate_from_arrow_type(field.type) for field in pa_schema}
  File ".venv/lib/python3.10/site-packages/datasets/features/features.py", line 1361, in generate_from_arrow_type
    raise NotImplementedError  # TODO(thom) this will need access to the dictionary as well (for labels). I.e. to the py_table
NotImplementedError

Motivation

Categorical data types, as offered by Pandas and implemented with the DictionaryType dtype in pyarrow can significantly reduce dataset size and are a handy way to turn textual features into numerical representations and back. Lack of support in Huggingface datasets greatly reduces compatibility with a common Pandas / Parquet feature.

Your contribution

I could provide a PR. However, it would be nice to have an initial complexity estimate from one of the core developers first.

The text was updated successfully, but these errors were encountered:

lhoestq · 2023-04-04T16:37:25Z

Hi ! We could definitely a type that holds the categories and uses a DictionaryType storage. There's a ClassLabel type that is similar with a 'names' parameter (similar to a id2label in deep learning frameworks) that uses an integer array as storage.

It can be added in features.py. Here are some pointers:

the conversion from HF type to PyArrow type is done in get_nested_type
the conversion from Pyarrow type to HF type is done in generate_from_arrow_type
encode_nested_example and decode_nested_example are used to do user's value (what users see) <-> storage value (what is in the pyarrow.array) if there's any conversion to do

mhattingpete · 2023-05-07T09:55:03Z

@kklemon did you implement this? Otherwise I would like to give it a try

kklemon · 2023-05-10T15:40:59Z

@mhattingpete no, I hadn't time for this so far. Feel free to work on this.

mhattingpete · 2023-05-12T19:21:33Z

#self-assign

mmcdermott · 2023-07-29T12:19:49Z

This would be super useful, so +1.

Also, these prior issues/PRs seem relevant:
#1906
#1936

amrit110 · 2023-08-11T13:57:39Z

Hi, this is a really useful feature, has this been implemented yet?

mmcdermott · 2023-09-07T23:36:14Z

Hey folks -- I'm thinking about trying a PR for this. As far as I can tell the only sticky point is that auto-generation of features from a pyarrow schema will fail under the current generate_from_arrow_type function because there is no encoding of the categorical string label -> int map in the pa.dictionary type itself; that is stored with the full array.

I see two ways to solve this. Option 1 is to require datasets with categorical types to use pyarrow schema metadata to encode the entire HF feature dictionary, that way categorical types don't ever need to be inferred from the pa type alone. The downside to this is that it means that these datasets will be a bit brittle, as if the feature encoding API ever changes, they will suddenly be unloadable.

The other option is to modify generate_from_arrow_type to take per-field metadata, and include just that metadata (the category labels) in the schema metadata.

Does anyone at HF have any preference on these two (or alternate) approaches?

lhoestq · 2023-09-08T09:37:38Z

Maybe we don't need to store the string label -> int map in the categorical for the corresponding datasets feature ?

mmcdermott · 2023-09-08T10:06:17Z

I think that does need to be stored in the Feature object. Similar to how `ClassLabel` needs the class names for some of the provided default functionality (e.g., encoding or decoding values) here, a categorical feature needs the same. Without storing that information, would you suggest that categorical features just be stored internally as integer arrays?

…

On Fri, Sep 8, 2023, 5:37 AM Quentin Lhoest ***@***.***> wrote: Maybe we don't need to store the string label -> int map in the categorical for the corresponding datasets feature ? — Reply to this email directly, view it on GitHub <#5706 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AADS5XZV3RA4GBRVBLJN72LXZLROZANCNFSM6AAAAAAWSOUTJ4> . You are receiving this because you commented.Message ID: ***@***.***>

lhoestq · 2023-09-08T10:48:54Z

Well IIRC you can concatenate two Arrow arrays with different dictionaries together. But for datasets would mean updating the datasets features when concatenating two arrays of the same type, which is not supported right now. That's why if there is a way to have it without storing the mapping in the feature object it would be nice.

For decoding we do have the string<->integer mapping from the array dictionary attribute so we're fine. For encoding I think it can work if we only encode when converting python objects to pyarrow in TypedSequence.__arrow_array__ in arow_writer.py. It can work by converting the python objects to a pyarrow array and then use the dictionary_encode method.

Another concern about concatenation: I noticed pyarrow creates the new dictionary and indices in memory when concatenating two dictionary encoded arrays. This can be a problem for big datastets, and we should probably use ChunkedArray objects instead. This can surely be taken care of in array_concat in table.py

cc @mariosasko in case you have other ideas

mmcdermott · 2023-09-08T11:12:26Z

Hmm, that is a good point. What if we implemented this feature first in a manner that didn't allow concatenation of arrays with different index to category maps? Then concatenation would be very straightforward, and I think this is reasonable if the index to category map is stored in the schema as well. Obviously, this is limited in how folks could use the feature, but they can always fall back to raw strings if needed, and as usage increases we'll have more data to see what the right solution here is.

…

On Fri, Sep 8, 2023, 6:49 AM Quentin Lhoest ***@***.***> wrote: Well IIRC you can concatenate two Arrow arrays with different dictionaries together. But for datasets would mean updating the datasets features when concatenating two arrays of the same type, which is not supported right now. That's why if there is a way to have it without storing the mapping in the feature object it would be nice. For decoding we do have the string<->integer mapping from the array dictionary attribute so we're fine. For encoding I think it can work if we only encode when converting python objects to pyarrow in TypedSequence.__arrow_array__ in arow_writer.py. It can work by converting the python objects to a pyarrow array and then use the dictionary_encode method. Another concern about concatenation: I noticed *pyarrow creates the new dictionary and indices in memory* when concatenating two dictionary encoded arrays. This can be a problem for big datastets, and we should probably use ChunkedArray objects instead. This can surely be taken care of in array_concat in table.py cc @mariosasko <https://github.com/mariosasko> in case you have other ideas — Reply to this email directly, view it on GitHub <#5706 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AADS5X4E2KC2IXLDPYR3XZLXZLZ2FANCNFSM6AAAAAAWSOUTJ4> . You are receiving this because you commented.Message ID: ***@***.***>

mmcdermott · 2023-09-21T18:11:31Z

@lhoestq @mariosasko just re-pinging on this so I can push forward further here. What are your thoughts on disallowing concatenation of categorical arrays for now such that the index to category map can be stored in the schema metadata? And/or other approaches that should be taken here?

lhoestq · 2023-09-22T15:23:58Z

I think the easiest for now would be to add a dictionary_decode argument to the parquet loaders that would convert the dictionary type back to strings when set to True, and make dictionary_decode=False raise NotImplementedError for now if there are dictionary type columns. Would that be ok as a first step ?

mmcdermott · 2023-09-22T15:46:01Z

I mean, that would certainly be easiest but I don't think it really solves this issue in a meaningful way. This just changes the burden from string conversion from the user to HF Datasets, but doesn't actually enable HF Datasets to take advantage of the (very significant) storage and associated speed/memory savings offered by using categorical types. Given that those savings are what is of real interest here, I think keeping it explicit that it is not supported (and forcing the user to do the conversion) might actually be better that way this problem stays top of mind.

Is there an objection with supporting categorical types explicitly through the medium I outlined above, where the error is raised if you try to concat two differently typed categorical columns?

lhoestq · 2023-09-22T16:02:25Z

This just changes the burden from string conversion from the user to HF Datasets, but doesn't actually enable HF Datasets to take advantage of the (very significant) storage and associated speed/memory savings offered by using categorical types.

There's already a ClassLabel type that does pretty much the same thing (store as integer instead of string) if it can help

Is there an objection with supporting categorical types explicitly through the medium I outlined above, where the error is raised if you try to concat two differently typed categorical columns?

Yea we do concatenation quite often (e.g. in map) so I don't think this is a viable option

mmcdermott · 2023-09-22T16:11:54Z

But how often in the cases where concatenation is done now would the categorical label vocabulary actually change? I think it would be in basically none of them. And in such cases, concatenation remains very easy, no?

…

On Fri, Sep 22, 2023, 12:02 PM Quentin Lhoest ***@***.***> wrote: This just changes the burden from string conversion from the user to HF Datasets, but doesn't actually enable HF Datasets to take advantage of the (very significant) storage and associated speed/memory savings offered by using categorical types. There's already a ClassLabel type that does pretty much the same thing (store as integer instead of string) if it can help Is there an objection with supporting categorical types explicitly through the medium I outlined above, where the error is raised if you try to concat two differently typed categorical columns? Yea we do concatenation quite often (e.g. in map) so I don't think this is a viable option — Reply to this email directly, view it on GitHub <#5706 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AADS5X5CGWFXDCML6UKCWYLX3WZBXANCNFSM6AAAAAAWSOUTJ4> . You are receiving this because you commented.Message ID: ***@***.***>

lhoestq · 2023-09-22T16:53:10Z

Arrow IPC seems to require unified dictionaries anyway so actually we could surely focus only on this use case indeed @mmcdermott

So defining a new Feature type in datasets that contains the dictionary mapping should be fine (and concatenation would work out of the box), and it should also take care of checking that the data it encodes/decodes has the right dictionary. Do you think it can be done without impacting iterating speed for the other types @mariosasko ?

Right now we have little bandwidth to work in this kind of things though

kklemon added the enhancement New feature or request label Apr 4, 2023

mhattingpete mentioned this issue May 7, 2023

datasets can't read a Parquet file in Python 3.9.13 #5264

Closed

github-actions bot assigned mhattingpete May 12, 2023

rudolfix mentioned this issue Oct 23, 2023

Support all writer types in parquet normalizer dlt-hub/dlt#704

Merged

lhoestq mentioned this issue Nov 10, 2023

Issue with pyarrow 14.0.1 #6396

Closed

EthanSteinberg mentioned this issue May 12, 2024

Add support for categorical/dictionary types #6892

Merged

lhoestq closed this as completed in #6892 Jun 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support categorical data types for Parquet #5706

Support categorical data types for Parquet #5706

kklemon commented Apr 4, 2023 •

edited

lhoestq commented Apr 4, 2023

mhattingpete commented May 7, 2023

kklemon commented May 10, 2023

mhattingpete commented May 12, 2023

mmcdermott commented Jul 29, 2023

amrit110 commented Aug 11, 2023

mmcdermott commented Sep 7, 2023

lhoestq commented Sep 8, 2023

mmcdermott commented Sep 8, 2023 via email

lhoestq commented Sep 8, 2023

mmcdermott commented Sep 8, 2023 via email

mmcdermott commented Sep 21, 2023

lhoestq commented Sep 22, 2023

mmcdermott commented Sep 22, 2023

lhoestq commented Sep 22, 2023

mmcdermott commented Sep 22, 2023 via email

lhoestq commented Sep 22, 2023 •

edited

Support categorical data types for Parquet #5706

Support categorical data types for Parquet #5706

Comments

kklemon commented Apr 4, 2023 • edited

Feature request

Motivation

Your contribution

lhoestq commented Apr 4, 2023

mhattingpete commented May 7, 2023

kklemon commented May 10, 2023

mhattingpete commented May 12, 2023

mmcdermott commented Jul 29, 2023

amrit110 commented Aug 11, 2023

mmcdermott commented Sep 7, 2023

lhoestq commented Sep 8, 2023

mmcdermott commented Sep 8, 2023 via email

lhoestq commented Sep 8, 2023

mmcdermott commented Sep 8, 2023 via email

mmcdermott commented Sep 21, 2023

lhoestq commented Sep 22, 2023

mmcdermott commented Sep 22, 2023

lhoestq commented Sep 22, 2023

mmcdermott commented Sep 22, 2023 via email

lhoestq commented Sep 22, 2023 • edited

kklemon commented Apr 4, 2023 •

edited

lhoestq commented Sep 22, 2023 •

edited