How to use Huggingface Datasets? #37

ms337 · 2024-05-15T18:35:20Z

How can we train on custom Huggingface datasets with this?

I could not find the code that takes the HF dataset outputted from pretokenize.py and converts to the sharded dataset.

But maybe Im missing something!

Thanks for this project btw!

zanussbaum · 2024-05-16T22:05:25Z

hey sorry for the delay @ms337 ! The pretokenize.py script is only used for MLM pretraining. If you have a dataset you want to train with a MLM objective, you can sub it here

If you are trying to do contrastive training, the dataset needs to be in a slightly different format which is roughly a gzipped jsonl file where the lines look something like

{"query": "what is the capital of france?", "document": "the capital of france is paris", "negatives": ["list of hard negatives"], 'metadata': {'objective': {'self': [], 'paired': [['query', 'document']], 'triplet': [['query', 'document', 'negatives']]}}}

The metadata field defines what columns to look for for each objective that eventually gets defined in the data config (e.g. https://github.com/nomic-ai/contrastors/blob/main/src/contrastors/configs/data/finetune_triplets.yaml#L15)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to use Huggingface Datasets? #37

How to use Huggingface Datasets? #37

ms337 commented May 15, 2024

zanussbaum commented May 16, 2024 •

edited

How to use Huggingface Datasets? #37

How to use Huggingface Datasets? #37

Comments

ms337 commented May 15, 2024

zanussbaum commented May 16, 2024 • edited

zanussbaum commented May 16, 2024 •

edited