Efficient data loading solution for large datasets Milestone · GitHub

New issue

Efficient data loading solution for large datasets

No due date 66% complete

Enable efficient data loading solution for LLM training.

Short term:

enable Alpaca dataset to produce the correct data on each rank (data parallel working sharded)
make sure it's performant when training on cluster

Long term:

For large enough datasets that can't all be loaded to CPU, use iterable/streaming feature
For iterable/streaming, build indices …

Enable efficient data loading solution for LLM training.

Short term:

enable Alpaca dataset to produce the correct data on each rank (data parallel working sharded)
make sure it's performant when training on cluster

Long term:

For large enough datasets that can't all be loaded to CPU, use iterable/streaming feature
For iterable/streaming, build indices and sampler to load data correctly