跳到内容

Distilabel 文档

LoadDataFromDisk

argilla-io/distilabel

LoadDataFromDisk¶

加载先前保存到磁盘的数据集。

如果您之前使用 save_to_disk 方法或 Distiset.save_to_disk 保存了数据集，您可以再次加载它以使用此类构建新的 pipeline。

属性¶

dataset_path: 数据集或 distiset 的路径。
split: 要加载的数据集拆分（通常是 train、test 或 validation）。
config: 要加载的数据集的配置。默认为 default，如果数据集有多个配置，则必须提供此配置，否则会引发错误。

运行时参数¶

batch_size: 处理数据时使用的批量大小。
dataset_path: 数据集或 distiset 的路径。
is_distiset: 要加载的数据集是否为 Distiset。默认为 False。
split: 要加载的数据集拆分。默认为 'train'。
config: 要加载的数据集的配置。默认为 default，如果数据集有多个配置，则必须提供此配置，否则会引发错误。
num_examples: 要从数据集中加载的示例数量。默认情况下将加载所有示例。
storage_options: 要传递给文件系统后端的键/值对（如果有）。默认为 None。

输入 & 输出列¶

graph TD
    subgraph Dataset
        subgraph New columns
            OCOL0[dynamic]
        end
    end

    subgraph LoadDataFromDisk
        StepOutput[Output Columns: dynamic]
    end

    StepOutput --> OCOL0

输出¶

dynamic (all): 此步骤将基于从 Hugging Face Hub 加载的数据集生成的列。

示例¶

从 Hugging Face Dataset 加载数据¶

from distilabel.steps import LoadDataFromDisk

loader = LoadDataFromDisk(dataset_path="path/to/dataset")
loader.load()

# Just like we saw with LoadDataFromDicts, the `process` method will yield batches.
result = next(loader.process())
# >>> result
# ([{'type': 'function', 'function':...', False)

从 distilabel Distiset 加载数据¶

from distilabel.steps import LoadDataFromDisk

# Specify the configuration to load.
loader = LoadDataFromDisk(
    dataset_path="path/to/dataset",
    is_distiset=True,
    config="leaf_step_1"
)
loader.load()

# Just like we saw with LoadDataFromDicts, the `process` method will yield batches.
result = next(loader.process())
# >>> result
# ([{'a': 1}, {'a': 2}, {'a': 3}], True)

从云提供商的 Hugging Face Dataset 或 Distiset 加载数据¶

from distilabel.steps import LoadDataFromDisk

loader = LoadDataFromDisk(
    dataset_path="gcs://path/to/dataset",
    storage_options={"project": "experiments-0001"}
)
loader.load()

# Just like we saw with LoadDataFromDicts, the `process` method will yield batches.
result = next(loader.process())
# >>> result
# ([{'type': 'function', 'function':...', False)