A problem that occurred in Preprocessing dataset #383

yuhkalhic · 2024-02-28T15:39:21Z

I'm a computer novice, and when fine-tuning using the google/civil_comments dataset, I implemented another preprocessing function modeled after get_preprocessed_samsum, but I kept having problems.

Can you help me find out what's wrong?
Here are my code

import datasets

def calculate_sentiment(row):
    scores = {
        'toxicity': row['toxicity'],
        'severe_toxicity': row['severe_toxicity'],
        'obscene': row['obscene'],
        'threat': row['threat'],
        'insult': row['insult'],
        'identity_attack': row['identity_attack'],
        'sexual_explicit': row['sexual_explicit']
    }
    max_category, max_score = max(scores.items(), key=lambda item: item[1])
    return max_category if max_score > 0.5 else "none"


def add_sentiment(example):
    example['sentiment'] = calculate_sentiment(example)
    return example


def remove_none_sentiment(example):
    return example['sentiment'] != "none"


def apply_prompt_template(example, tokenizer):
    prompt_template = "Analyze the sentiment of this sentence:\n{text}\n---\nSentiment:\n"
    prompt = tokenizer.encode(tokenizer.bos_token + prompt_template, add_special_tokens=False)
    sentiment = tokenizer.encode(example["sentiment"] + tokenizer.eos_token, add_special_tokens=False)

    return {
        "input_ids": prompt + sentiment,
        "attention_mask": [1] * (len(prompt) + len(sentiment)),
        "labels": [-100] * len(prompt) + sentiment,
    }


def get_preprocessed_civil_comments(dataset_config, tokenizer, split):
    dataset = datasets.load_dataset("civil_comments", split=split)

    dataset = dataset.map(add_sentiment)
    dataset = dataset.filter(remove_none_sentiment)

    dataset = dataset.map(lambda example: apply_prompt_template(example, tokenizer), batched=True)

    return dataset

Traceback (most recent call last):
File "examples/finetuning.py", line 8, in
fire.Fire(main)
...
File "/data/llama-recipes/src/llama_recipes/data/concatenator.py", line 24, in
buffer = {k: v + sample[k] for k,v in buffer.items()}
KeyError: 'input_ids'

[2024-02-28 23:34:18,646] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 60594) of binary: /data/miniconda3/envs/dachuang/bin/python
Traceback (most recent call last):
File "/data/miniconda3/envs/dachuang/bin/torchrun", line 8, in
sys.exit(main())
...
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

examples/finetuning.py FAILED
...
Root Cause (first observed failure):
[0]:
time : 2024-02-28_23:34:18
host : amax
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 60594)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A problem that occurred in Preprocessing dataset #383

A problem that occurred in Preprocessing dataset #383

yuhkalhic commented Feb 28, 2024 •

edited

A problem that occurred in Preprocessing dataset #383

A problem that occurred in Preprocessing dataset #383

Comments

yuhkalhic commented Feb 28, 2024 • edited

examples/finetuning.py FAILED ... Root Cause (first observed failure): [0]: time : 2024-02-28_23:34:18 host : amax rank : 0 (local_rank: 0) exitcode : 1 (pid: 60594) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

yuhkalhic commented Feb 28, 2024 •

edited

examples/finetuning.py FAILED
...
Root Cause (first observed failure):
[0]:
time : 2024-02-28_23:34:18
host : amax
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 60594)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html