Skip to content

Efficient data loading solution for large datasets

No due date 66% complete

Enable efficient data loading solution for LLM training.

Short term:

  • enable Alpaca dataset to produce the correct data on each rank (data parallel working sharded)
  • make sure it's performant when training on cluster

Long term:

  • For large enough datasets that can't all be loaded to CPU, use iterable/streaming feature
  • For iterable/streaming, build indices …

Enable efficient data loading solution for LLM training.

Short term:

  • enable Alpaca dataset to produce the correct data on each rank (data parallel working sharded)
  • make sure it's performant when training on cluster

Long term:

  • For large enough datasets that can't all be loaded to CPU, use iterable/streaming feature
  • For iterable/streaming, build indices and sampler to load data correctly