-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support categorical data types for Parquet #5706
Comments
Hi ! We could definitely a type that holds the categories and uses a DictionaryType storage. There's a ClassLabel type that is similar with a 'names' parameter (similar to a id2label in deep learning frameworks) that uses an integer array as storage. It can be added in
|
@kklemon did you implement this? Otherwise I would like to give it a try |
@mhattingpete no, I hadn't time for this so far. Feel free to work on this. |
#self-assign |
Hi, this is a really useful feature, has this been implemented yet? |
Hey folks -- I'm thinking about trying a PR for this. As far as I can tell the only sticky point is that auto-generation of features from a pyarrow schema will fail under the current I see two ways to solve this. Option 1 is to require datasets with categorical types to use pyarrow schema metadata to encode the entire HF feature dictionary, that way categorical types don't ever need to be inferred from the pa type alone. The downside to this is that it means that these datasets will be a bit brittle, as if the feature encoding API ever changes, they will suddenly be unloadable. The other option is to modify Does anyone at HF have any preference on these two (or alternate) approaches? |
Maybe we don't need to store the string label -> int map in the categorical for the corresponding |
I think that does need to be stored in the Feature object. Similar to how
`ClassLabel` needs the class names for some of the provided default
functionality (e.g., encoding or decoding values) here, a categorical
feature needs the same. Without storing that information, would you suggest
that categorical features just be stored internally as integer arrays?
…On Fri, Sep 8, 2023, 5:37 AM Quentin Lhoest ***@***.***> wrote:
Maybe we don't need to store the string label -> int map in the
categorical for the corresponding datasets feature ?
—
Reply to this email directly, view it on GitHub
<#5706 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AADS5XZV3RA4GBRVBLJN72LXZLROZANCNFSM6AAAAAAWSOUTJ4>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
Well IIRC you can concatenate two Arrow arrays with different dictionaries together. But for For decoding we do have the string<->integer mapping from the array Another concern about concatenation: I noticed pyarrow creates the new dictionary and indices in memory when concatenating two dictionary encoded arrays. This can be a problem for big datastets, and we should probably use ChunkedArray objects instead. This can surely be taken care of in cc @mariosasko in case you have other ideas |
Hmm, that is a good point. What if we implemented this feature first in a
manner that didn't allow concatenation of arrays with different index to
category maps? Then concatenation would be very straightforward, and I
think this is reasonable if the index to category map is stored in the
schema as well. Obviously, this is limited in how folks could use the
feature, but they can always fall back to raw strings if needed, and as
usage increases we'll have more data to see what the right solution here
is.
…On Fri, Sep 8, 2023, 6:49 AM Quentin Lhoest ***@***.***> wrote:
Well IIRC you can concatenate two Arrow arrays with different dictionaries
together. But for datasets would mean updating the datasets features when
concatenating two arrays of the same type, which is not supported right
now. That's why if there is a way to have it without storing the mapping in
the feature object it would be nice.
For decoding we do have the string<->integer mapping from the array
dictionary attribute so we're fine. For encoding I think it can work if
we only encode when converting python objects to pyarrow in
TypedSequence.__arrow_array__ in arow_writer.py. It can work by
converting the python objects to a pyarrow array and then use the
dictionary_encode method.
Another concern about concatenation: I noticed *pyarrow creates the new
dictionary and indices in memory* when concatenating two dictionary
encoded arrays. This can be a problem for big datastets, and we should
probably use ChunkedArray objects instead. This can surely be taken care of
in array_concat in table.py
cc @mariosasko <https://github.com/mariosasko> in case you have other
ideas
—
Reply to this email directly, view it on GitHub
<#5706 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AADS5X4E2KC2IXLDPYR3XZLXZLZ2FANCNFSM6AAAAAAWSOUTJ4>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
@lhoestq @mariosasko just re-pinging on this so I can push forward further here. What are your thoughts on disallowing concatenation of categorical arrays for now such that the index to category map can be stored in the schema metadata? And/or other approaches that should be taken here? |
I think the easiest for now would be to add a |
I mean, that would certainly be easiest but I don't think it really solves this issue in a meaningful way. This just changes the burden from string conversion from the user to HF Datasets, but doesn't actually enable HF Datasets to take advantage of the (very significant) storage and associated speed/memory savings offered by using categorical types. Given that those savings are what is of real interest here, I think keeping it explicit that it is not supported (and forcing the user to do the conversion) might actually be better that way this problem stays top of mind. Is there an objection with supporting categorical types explicitly through the medium I outlined above, where the error is raised if you try to concat two differently typed categorical columns? |
There's already a ClassLabel type that does pretty much the same thing (store as integer instead of string) if it can help
Yea we do concatenation quite often (e.g. in |
But how often in the cases where concatenation is done now would the
categorical label vocabulary actually change? I think it would be in
basically none of them. And in such cases, concatenation remains very easy,
no?
…On Fri, Sep 22, 2023, 12:02 PM Quentin Lhoest ***@***.***> wrote:
This just changes the burden from string conversion from the user to HF
Datasets, but doesn't actually enable HF Datasets to take advantage of the
(very significant) storage and associated speed/memory savings offered by
using categorical types.
There's already a ClassLabel type that does pretty much the same thing
(store as integer instead of string) if it can help
Is there an objection with supporting categorical types explicitly through
the medium I outlined above, where the error is raised if you try to concat
two differently typed categorical columns?
Yea we do concatenation quite often (e.g. in map) so I don't think this
is a viable option
—
Reply to this email directly, view it on GitHub
<#5706 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AADS5X5CGWFXDCML6UKCWYLX3WZBXANCNFSM6AAAAAAWSOUTJ4>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
Arrow IPC seems to require unified dictionaries anyway so actually we could surely focus only on this use case indeed @mmcdermott So defining a new Feature type in Right now we have little bandwidth to work in this kind of things though |
Feature request
Huggingface datasets does not seem to support categorical / dictionary data types for Parquet as of now. There seems to be a
TODO
in the code for this feature but no implementation yet. Below you can find sample code to reproduce the error that is currently thrown when attempting to read a Parquet file with categorical columns:Error:
Motivation
Categorical data types, as offered by Pandas and implemented with the
DictionaryType
dtype inpyarrow
can significantly reduce dataset size and are a handy way to turn textual features into numerical representations and back. Lack of support in Huggingface datasets greatly reduces compatibility with a common Pandas / Parquet feature.Your contribution
I could provide a PR. However, it would be nice to have an initial complexity estimate from one of the core developers first.
The text was updated successfully, but these errors were encountered: