使用 Distiset 数据集对象¶

`distilabel` 中的 Pipeline 返回一种特殊的 Hugging Face datasets.DatasetDict 类型，称为 Distiset。

Distiset 是一个类似字典的对象，其中包含 Pipeline 生成的不同配置，其中每个配置对应于 Pipeline 构建的 DAG 中的每个叶步骤。每个配置对应于数据集的不同子集。这是一个从 🤗 datasets 借鉴的概念，它允许您在同一仓库中上传同一数据集的不同配置，并且可以包含不同的列，即不同的配置，这些配置可以无缝推送到 Hugging Face Hub。

下面您可以找到一个示例，说明如何创建类似于 datasets.DatasetDict 的 Distiset 对象

from datasets import Dataset
from distilabel.distiset import Distiset

distiset = Distiset(
    {
        "leaf_step_1": Dataset.from_dict({"instruction": [1, 2, 3]}),
        "leaf_step_2": Dataset.from_dict(
            {"instruction": [1, 2, 3, 4], "generation": [5, 6, 7, 8]}
        ),
    }
)

注意

如果只有一个叶节点，即 Pipeline 末尾只有一个步骤，则配置名称不会是最后一个步骤的名称，而是设置为“default”，因为这更符合 Hugging Face Hub 中的标准数据集。

Distiset 方法¶

我们可以与 Pipeline 生成的不同部分进行交互，并将它们视为不同的 configurations。Distiset 仅包含两个方法

训练/测试集划分¶

为不同的配置或子集创建数据集的训练/测试集划分。

>>> distiset.train_test_split(train_size=0.9)
Distiset({
    leaf_step_1: DatasetDict({
        train: Dataset({
            features: ['instruction'],
            num_rows: 2
        })
        test: Dataset({
            features: ['instruction'],
            num_rows: 1
        })
    })
    leaf_step_2: DatasetDict({
        train: Dataset({
            features: ['instruction', 'generation'],
            num_rows: 3
        })
        test: Dataset({
            features: ['instruction', 'generation'],
            num_rows: 1
        })
    })
})

推送到 Hugging Face Hub¶

将 Distiset 推送到 Hugging Face 仓库，其中每个子集将对应于不同的配置

distiset.push_to_hub(
    "my-org/my-dataset",
    commit_message="Initial commit",
    private=False,
    token=os.getenv("HF_TOKEN"),
    generate_card=True,
    include_script=False
)

1.3.0 版本新增

自 1.3.0 版本起，您可以自动推送创建 pipeline 的脚本到同一仓库。例如，假设您有一个如下文件

sample_pipe.py

with Pipeline() as pipe:
    ...
distiset = pipe.run()
distiset.push_to_hub(
    "my-org/my-dataset,
    include_script=True
)

运行命令后，您可以访问该仓库，文件 sample_pipe.py 将被存储，以简化与社区共享您的 pipeline。

自定义文档字符串¶

distilabel 包含一个自定义插件，用于自动生成不同组件的库。信息通过解析 Step 的文档字符串提取。您可以查看 UltraFeedback 的源代码中的文档字符串，并查看组件库中相应的条目，以查看文档字符串如何呈现的示例。

如果您创建了自己的组件，并希望在 README 卡片中自动呈现 Citations（如果您在 Hugging Face Hub 中共享最终 distiset），您可能需要添加引文部分。这是 MagpieGenerator 任务的示例

class MagpieGenerator(GeneratorTask, MagpieBase):
    r"""Generator task the generates instructions or conversations using Magpie.
    ...

    Citations:

        ```
        @misc{xu2024magpiealignmentdatasynthesis,
            title={Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing},
            author={Zhangchen Xu and Fengqing Jiang and Luyao Niu and Yuntian Deng and Radha Poovendran and Yejin Choi and Bill Yuchen Lin},
            year={2024},
            eprint={2406.08464},
            archivePrefix={arXiv},
            primaryClass={cs.CL},
            url={https://arxiv.org/abs/2406.08464},
        }
        ```
    """

Citations 部分可以包含任意数量的 bibtex 参考。要定义它们，您可以根据需要在示例中添加尽可能多的元素：每个引文都将是 ```@misc{...}``` 形式的块。如果您决定调用 distiset.push_to_hub，则此信息将自动用于 Distiset 的 README 中。或者，如果未找到 Citations，但在 References 中找到任何指向 https://arxiv.org/ 的 URL，我们将尝试自动获取 Bibtex 等效项。这样，Hugging Face 可以自动跟踪论文，并且更容易找到引用同一论文的其他数据集，或直接访问论文页面。

图像数据集¶

如果您对图像数据集感兴趣，请继续阅读

Distiset 对象有一个新方法 transform_columns_to_image，专门用于在将数据集推送到 hugging face hub 之前将图像转换为 PIL.Image.Image。

自 1.5.0 版本起，我们有了 ImageGeneration 任务，该任务能够从文本生成图像。默认情况下，所有过程将在内部使用图像的字符串表示形式。这样做是为了在处理时简化操作。但是，为了利用 Hugging Face Hub 的功能，如果生成的数据集将存储在那里，则合适的 Image 对象可能更可取，这样我们就可以在数据集查看器中看到图像。让我们看一下从仓库根目录的“examples/image_generation.py”中提取的以下 pipeline，看看我们如何做到这一点

# Assume all the imports are already done, we are only interested
with Pipeline(name="image_generation_pipeline") as pipeline:
    img_generation = ImageGeneration(
        name="flux_schnell",
        llm=igm,
        InferenceEndpointsImageGeneration(model_id="black-forest-labs/FLUX.1-schnell")
    )
    ...

if __name__ == "__main__":
    distiset = pipeline.run(use_cache=False, dataset=ds)
    # Save the images as `PIL.Image.Image`
+   distiset = distiset.transform_columns_to_image("image")
    distiset.push_to_hub(...)

在对我们可能生成的图像列调用 transform_columns_to_image 之后（在本例中，我们只想转换 image 列，但可以传递列表）。这将应用于我们在 pipeline 中拥有的任何叶节点，这意味着如果我们有不同的子集，则将在所有子集中找到“image”列，或者我们可以传递列列表。

从磁盘保存和加载¶

请注意，这些方法的工作方式与 datasets.load_from_disk 和 datasets.Dataset.save_to_disk 相同，因此参数直接传递给这些方法。这意味着您还可以使用 storage_options 参数将 Distiset 保存在云提供商中，包括 distilabel 工件（pipeline.yaml、pipeline.log 和带有数据集卡片的 README.md）。您可以在 datasets 文档此处中阅读更多内容。

保存到磁盘从磁盘加载（本地）从磁盘加载（云端）

将 Distiset 保存到磁盘，并可选择（默认情况下会执行）保存数据集卡片、pipeline 配置文件和日志

distiset.save_to_disk(
    "my-dataset",
    save_card=True,
    save_pipeline_config=True,
    save_pipeline_log=True
)

加载使用 Distiset.save_to_disk 保存的 Distiset，方式相同

distiset = Distiset.load_from_disk("my-dataset")

从远程位置（如 S3、GCS）加载 Distiset。您可以传递 storage_options 参数以向云提供商进行身份验证

distiset = Distiset.load_from_disk(
    "s3://path/to/my_dataset",  # gcs:// or any filesystem tolerated by fsspec
    storage_options={
        "key": os.environ["S3_ACCESS_KEY"],
        "secret": os.environ["S3_SECRET_KEY"],
        ...
    }
)

请查看 Distiset.save_to_disk 和 Distiset.load_from_disk 中的其余参数。

数据集卡片¶

调用 Distiset.push_to_hub 时，这种特殊类型的数据集具有一个额外的优势，即自动生成 Hugging Face Hub 中的数据集卡片。请注意，默认情况下启用此功能，但可以通过设置 generate_card=False 来禁用

distiset.push_to_hub("my-org/my-dataset", generate_card=True)

我们将拥有一个自动生成的数据集卡片（示例可以在此处查看），其中包含一些方便的信息，例如使用 CLI 重现 Pipeline，或来自不同子集的记录示例。

create_distiset 助手函数¶

最后，我们在缓存部分介绍了 create_distiset 函数，您可以查看该部分，了解如何使用助手函数从缓存文件夹创建 Distiset，以自动包含所有相关数据。