Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lab.find_issues(features=features) outputs error for underperforming issue #1065

Closed
sanjanag opened this issue Mar 26, 2024 · 1 comment · Fixed by #1099
Closed

lab.find_issues(features=features) outputs error for underperforming issue #1065

sanjanag opened this issue Mar 26, 2024 · 1 comment · Fixed by #1099
Labels
bug Something isn't working help-wanted We need your help to add this, but it may be more challenging than a "good first issue"

Comments

@sanjanag
Copy link
Member

lab.find_issues(features=features) output

[/Users/sanjana/cleanlab_home/fork_cleanlab/cleanlab/datalab/internal/issue_finder.py:457](https://file+.vscode-resource.vscode-cdn.net/Users/sanjana/cleanlab_home/fork_cleanlab/cleanlab/datalab/internal/issue_finder.py:457): UserWarning: No labels were provided. The 'label' issue type will not be run.
  warnings.warn("No labels were provided. " "The 'label' issue type will not be run.")
Finding null issues ...
Finding outlier issues ...
Fitting OOD estimator based on provided features ...
Finding near_duplicate issues ...
Finding non_iid issues ...
Finding underperforming_group issues ...
Error in underperforming_group: UnderperformingGroupIssueManager.find_issues() missing 1 required positional argument: 'pred_probs'
Failed to check for these issue types: [UnderperformingGroupIssueManager]

Audit complete. 984 issues found in the dataset.

Dataset: https://www.kaggle.com/datasets/laotse/credit-risk-dataset/data

Code

import pandas as pd
from cleanlab import Datalab
from sklearn.preprocessing import StandardScaler
import numpy as np

df = pd.read_csv("./credit_risk_dataset.csv")
df = df[~df.isnull().any(axis=1)].copy()
feature_columns = df.columns.to_list()
feature_columns.remove("loan_status")

X_raw = df[feature_columns]
labels = df["loan_status"]

cat_features = [
    "person_home_ownership",
    "loan_intent",
    "loan_grade",
    "cb_person_default_on_file",
]
numeric_features = [
    "person_age",
    "person_income",
    "person_emp_length",
    "loan_amnt",
    "loan_int_rate",
    "loan_percent_income",
    "cb_person_cred_hist_length",
]

X_encoded = pd.get_dummies(X_raw, columns=cat_features, drop_first=True, dtype='float')

scaler = StandardScaler()
X_processed = X_encoded.copy()
X_processed[numeric_features] = scaler.fit_transform(X_encoded[numeric_features])

lab = Datalab({"X": X_processed.to_numpy(), "y": labels})

lab.find_issues(features=X_processed.to_numpy())
@jwmueller jwmueller added bug Something isn't working and removed needs triage labels Apr 11, 2024
@jwmueller
Copy link
Member

@elisno seems like the mapping that decides what issue-types to run based on the supplied args is off. The Underperforming group check should only run if pred_probs were included in the supplied args.

@jwmueller jwmueller added the help-wanted We need your help to add this, but it may be more challenging than a "good first issue" label Apr 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help-wanted We need your help to add this, but it may be more challenging than a "good first issue"
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants