You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
hey sorry for the delay @ms337 ! The pretokenize.py script is only used for MLM pretraining. If you have a dataset you want to train with a MLM objective, you can sub it here
If you are trying to do contrastive training, the dataset needs to be in a slightly different format which is roughly a gzipped jsonl file where the lines look something like
{"query": "what is the capital of france?", "document": "the capital of france is paris", "negatives": ["list of hard negatives"], 'metadata': {'objective': {'self': [], 'paired': [['query', 'document']], 'triplet': [['query', 'document', 'negatives']]}}}
How can we train on custom Huggingface datasets with this?
I could not find the code that takes the HF dataset outputted from pretokenize.py and converts to the sharded dataset.
But maybe Im missing something!
Thanks for this project btw!
The text was updated successfully, but these errors were encountered: