Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trying to create Datalab object with label set to a dtype of 'category' but getting 'NotImplementedError' #1070

Open
mturk24 opened this issue Mar 28, 2024 · 0 comments
Labels
bug Something isn't working help-wanted We need your help to add this, but it may be more challenging than a "good first issue"

Comments

@mturk24
Copy link
Contributor

mturk24 commented Mar 28, 2024

When trying to run the following code:


lab = Datalab(data=full_df, label_name="noisy_letter_grade", task="classification")
lab.find_issues(features=features_df.to_numpy(), issue_types={"near_duplicate": {}, "non_iid": {}})
lab.report(show_summary_score=True, show_all_issues=True)

in which the noisy_letter_grade column has a dtype of category rather than object, I get the following error message printed below in the stack trace section.

I am wondering if we should support category dtypes for label columns passed into Datalab? Or can we change the error message for a user to change their label dtype?

Stack trace

---------------------------------------------------------------------------
NotImplementedError                       Traceback (most recent call last)
Cell In[99], line 1
----> 1 lab = Datalab(data=full_df, label_name="noisy_letter_grade", task="classification")
      2 lab.find_issues(features=full_df.drop("noisy_letter_grade", axis=1).to_numpy(), issue_types={"near_duplicate": {}, "non_iid": {}})
      3 lab.report(show_summary_score=True, show_all_issues=True)

File [~/mturk-work/mturk-env/lib/python3.11/site-packages/cleanlab/datalab/datalab.py:123](http://localhost:8950/lab/tree/docs/source/tutorials/~/mturk-work/mturk-env/lib/python3.11/site-packages/cleanlab/datalab/datalab.py#line=122), in Datalab.__init__(self, data, task, label_name, image_key, verbosity)
    112 def __init__(
    113     self,
    114     data: "DatasetLike",
   (...)
    120     # Assume continuous values of labels for regression task
    121     # Map labels to integers for classification task
    122     self.task = Task.from_str(task)
--> 123     self._data = Data(data, self.task, label_name)
    124     self.data = self._data._data
    125     self._labels = self._data.labels

File [~/mturk-work/mturk-env/lib/python3.11/site-packages/cleanlab/datalab/internal/data.py:155](http://localhost:8950/lab/tree/docs/source/tutorials/~/mturk-work/mturk-env/lib/python3.11/site-packages/cleanlab/datalab/internal/data.py#line=154), in Data.__init__(self, data, task, label_name)
    148 def __init__(
    149     self,
    150     data: "DatasetLike",
    151     task: Task,
    152     label_name: Optional[str] = None,
    153 ) -> None:
    154     self._validate_data(data)
--> 155     self._data = self._load_data(data)
    156     self._data_hash = hash(self._data)
    157     self.labels: Label

File [~/mturk-work/mturk-env/lib/python3.11/site-packages/cleanlab/datalab/internal/data.py:174](http://localhost:8950/lab/tree/docs/source/tutorials/~/mturk-work/mturk-env/lib/python3.11/site-packages/cleanlab/datalab/internal/data.py#line=173), in Data._load_data(self, data)
    172 if not isinstance(data, tuple(dataset_factory_map.keys())):
    173     raise DataFormatError(data)
--> 174 return dataset_factory_map[type(data)](data)

File [~/mturk-work/mturk-env/lib/python3.11/site-packages/datasets/arrow_dataset.py:869](http://localhost:8950/lab/tree/docs/source/tutorials/~/mturk-work/mturk-env/lib/python3.11/site-packages/datasets/arrow_dataset.py#line=868), in Dataset.from_pandas(cls, df, features, info, split, preserve_index)
    865 if features is not None:
    866     # more expensive cast than InMemoryTable.from_pandas(..., schema=features.arrow_schema)
    867     # needed to support the str to Audio conversion for instance
    868     table = table.cast(features.arrow_schema)
--> 869 return cls(table, info=info, split=split)

File [~/mturk-work/mturk-env/lib/python3.11/site-packages/datasets/arrow_dataset.py:694](http://localhost:8950/lab/tree/docs/source/tutorials/~/mturk-work/mturk-env/lib/python3.11/site-packages/datasets/arrow_dataset.py#line=693), in Dataset.__init__(self, arrow_table, info, split, indices_table, fingerprint)
    691         self._fingerprint = metadata["fingerprint"]
    693 # Infer features if None
--> 694 inferred_features = Features.from_arrow_schema(arrow_table.schema)
    695 if self.info.features is None:
    696     self.info.features = inferred_features

File [~/mturk-work/mturk-env/lib/python3.11/site-packages/datasets/features/features.py:1679](http://localhost:8950/lab/tree/docs/source/tutorials/~/mturk-work/mturk-env/lib/python3.11/site-packages/datasets/features/features.py#line=1678), in Features.from_arrow_schema(cls, pa_schema)
   1677         metadata_features = Features.from_dict(metadata["info"]["features"])
   1678 metadata_features_schema = metadata_features.arrow_schema
-> 1679 obj = {
   1680     field.name: (
   1681         metadata_features[field.name]
   1682         if field.name in metadata_features and metadata_features_schema.field(field.name) == field
   1683         else generate_from_arrow_type(field.type)
   1684     )
   1685     for field in pa_schema
   1686 }
   1687 return cls(**obj)

File [~/mturk-work/mturk-env/lib/python3.11/site-packages/datasets/features/features.py:1683](http://localhost:8950/lab/tree/docs/source/tutorials/~/mturk-work/mturk-env/lib/python3.11/site-packages/datasets/features/features.py#line=1682), in <dictcomp>(.0)
   1677         metadata_features = Features.from_dict(metadata["info"]["features"])
   1678 metadata_features_schema = metadata_features.arrow_schema
   1679 obj = {
   1680     field.name: (
   1681         metadata_features[field.name]
   1682         if field.name in metadata_features and metadata_features_schema.field(field.name) == field
-> 1683         else generate_from_arrow_type(field.type)
   1684     )
   1685     for field in pa_schema
   1686 }
   1687 return cls(**obj)

File [~/mturk-work/mturk-env/lib/python3.11/site-packages/datasets/features/features.py:1395](http://localhost:8950/lab/tree/docs/source/tutorials/~/mturk-work/mturk-env/lib/python3.11/site-packages/datasets/features/features.py#line=1394), in generate_from_arrow_type(pa_type)
   1393     return array_feature(shape=pa_type.shape, dtype=pa_type.value_type)
   1394 elif isinstance(pa_type, pa.DictionaryType):
-> 1395     raise NotImplementedError  # TODO(thom) this will need access to the dictionary as well (for labels). I.e. to the py_table
   1396 elif isinstance(pa_type, pa.DataType):
   1397     return Value(dtype=_arrow_to_datasets_dtype(pa_type))

NotImplementedError:

Additional information

Cleanlab version being used is 2.6.1

@mturk24 mturk24 added needs triage bug Something isn't working low priority labels Mar 28, 2024
@jwmueller jwmueller added help-wanted We need your help to add this, but it may be more challenging than a "good first issue" and removed needs triage labels Apr 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help-wanted We need your help to add this, but it may be more challenging than a "good first issue"
Projects
None yet
Development

No branches or pull requests

2 participants