Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A problem that occurred in Preprocessing dataset #383

Open
yuhkalhic opened this issue Feb 28, 2024 · 0 comments
Open

A problem that occurred in Preprocessing dataset #383

yuhkalhic opened this issue Feb 28, 2024 · 0 comments

Comments

@yuhkalhic
Copy link

yuhkalhic commented Feb 28, 2024

I'm a computer novice, and when fine-tuning using the google/civil_comments dataset, I implemented another preprocessing function modeled after get_preprocessed_samsum, but I kept having problems.

Can you help me find out what's wrong?
Here are my code

import datasets

def calculate_sentiment(row):
    scores = {
        'toxicity': row['toxicity'],
        'severe_toxicity': row['severe_toxicity'],
        'obscene': row['obscene'],
        'threat': row['threat'],
        'insult': row['insult'],
        'identity_attack': row['identity_attack'],
        'sexual_explicit': row['sexual_explicit']
    }
    max_category, max_score = max(scores.items(), key=lambda item: item[1])
    return max_category if max_score > 0.5 else "none"


def add_sentiment(example):
    example['sentiment'] = calculate_sentiment(example)
    return example


def remove_none_sentiment(example):
    return example['sentiment'] != "none"


def apply_prompt_template(example, tokenizer):
    prompt_template = "Analyze the sentiment of this sentence:\n{text}\n---\nSentiment:\n"
    prompt = tokenizer.encode(tokenizer.bos_token + prompt_template, add_special_tokens=False)
    sentiment = tokenizer.encode(example["sentiment"] + tokenizer.eos_token, add_special_tokens=False)

    return {
        "input_ids": prompt + sentiment,
        "attention_mask": [1] * (len(prompt) + len(sentiment)),
        "labels": [-100] * len(prompt) + sentiment,
    }


def get_preprocessed_civil_comments(dataset_config, tokenizer, split):
    dataset = datasets.load_dataset("civil_comments", split=split)

    dataset = dataset.map(add_sentiment)
    dataset = dataset.filter(remove_none_sentiment)

    dataset = dataset.map(lambda example: apply_prompt_template(example, tokenizer), batched=True)

    return dataset

Traceback (most recent call last):
File "examples/finetuning.py", line 8, in
fire.Fire(main)
...
File "/data/llama-recipes/src/llama_recipes/data/concatenator.py", line 24, in
buffer = {k: v + sample[k] for k,v in buffer.items()}
KeyError: 'input_ids'

[2024-02-28 23:34:18,646] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 60594) of binary: /data/miniconda3/envs/dachuang/bin/python
Traceback (most recent call last):
File "/data/miniconda3/envs/dachuang/bin/torchrun", line 8, in
sys.exit(main())
...
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

examples/finetuning.py FAILED
...
Root Cause (first observed failure):
[0]:
time : 2024-02-28_23:34:18
host : amax
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 60594)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant