跳到内容

清理现有的偏好数据集

Knowledge graph figure

入门指南

安装依赖项

要完成本教程,您需要通过 pip 安装 distilabel SDK 和一些第三方库。在本教程中,我们将使用免费但速率受限的 Hugging Face 无服务器推理 API,因此我们需要将其作为额外的 distilabel 依赖项安装。您可以通过运行以下命令来安装它们

!pip install "distilabel[hf-inference-endpoints]"
!pip install "transformers~=4.0" "torch~=2.0"

让我们进行必要的导入

import random

from datasets import load_dataset

from distilabel.models import InferenceEndpointsLLM
from distilabel.pipeline import Pipeline
from distilabel.steps import (
    KeepColumns,
    LoadDataFromDicts,
    PreferenceToArgilla,
)
from distilabel.steps.tasks import UltraFeedback

您需要一个 HF_TOKEN 才能使用 HF 推理端点。登录以在本笔记本中直接使用它。

import os
from huggingface_hub import login

login(token=os.getenv("HF_TOKEN"), add_to_git_credential=True)

(可选) 部署 Argilla

您可以跳过此步骤或将其替换为任何其他数据评估工具,但是您的模型质量会因缺乏数据质量而受到影响,因此我们建议您查看数据。如果您已经部署了 Argilla,则可以跳过此步骤。否则,您可以按照本指南快速部署 Argilla。

除此之外,您还需要安装 Argilla 作为 distilabel 的额外组件。

!pip install "distilabel[argilla, hf-inference-endpoints]"

数据集

在本例中,我们将清理一个偏好数据集,因此我们将使用 Hugging Face Hub 中的 Intel/orca_dpo_pairs 数据集。

dataset = load_dataset("Intel/orca_dpo_pairs", split="train[:20]")

接下来,我们将打乱 chosenrejected 列,以避免数据集中出现任何偏差。

def shuffle_and_track(chosen, rejected):
    pair = [chosen, rejected]
    random.shuffle(pair)
    order = ["chosen" if x == chosen else "rejected" for x in pair]
    return {"generations": pair, "order": order}

dataset = dataset.map(lambda x: shuffle_and_track(x["chosen"], x["rejected"]))
dataset = dataset.to_list()
作为自定义步骤

您还可以在单独的模块中创建一个自定义步骤,导入它,并在使用 LoadDataFromHub 步骤加载 orca_dpo_pairs 数据集后将其添加到管道中。

shuffle_step.py
from typing import TYPE_CHECKING, List
from distilabel.steps import GlobalStep, StepInput

if TYPE_CHECKING:
    from distilabel.typing import StepOutput

import random

class ShuffleStep(GlobalStep):
    @property
    def inputs(self):
        """Returns List[str]: The inputs of the step."""
        return ["instruction", "chosen", "rejected"]

    @property
    def outputs(self):
        """Returns List[str]: The outputs of the step."""
        return ["instruction", "generations", "order"]

    def process(self, inputs: StepInput):
        """Returns StepOutput: The outputs of the step."""
        outputs = []

        for input in inputs:
            chosen = input["chosen"]
            rejected = input["rejected"]
            pair = [chosen, rejected]
            random.shuffle(pair)
            order = ["chosen" if x == chosen else "rejected" for x in pair]

            outputs.append({"instruction": input["instruction"], "generations": pair, "order": order})

        yield outputs
from shuffle_step import ShuffleStep

定义管道

要清理现有的偏好数据集,我们将需要定义一个包含所有必要步骤的 Pipeline。但是,类似的工作流程也可用于清理 SFT 数据集。下面,我们将详细介绍每个步骤。

加载数据集

我们将使用刚刚打乱的数据集作为源数据。

  • 组件: LoadDataFromDicts
  • 输入列: system, question, chosen, rejected, generationsorder,与加载的字典列表中的键相同。
  • 输出列: system, instruction, chosen, rejected, generationsorder。我们将使用 output_mappings 重命名列。
load_dataset = LoadDataFromDicts(
    data=dataset[:1],
    output_mappings={"question": "instruction"},
    pipeline=Pipeline(name="showcase-pipeline"),
)
load_dataset.load()
next(load_dataset.process())
([{'system': '',
   'question': "You will be given a definition of a task first, then some input of the task.\nThis task is about using the specified sentence and converting the sentence to Resource Description Framework (RDF) triplets of the form (subject, predicate object). The RDF triplets generated must be such that the triplets accurately capture the structure and semantics of the input sentence. The input is a sentence and the output is a list of triplets of the form [subject, predicate, object] that capture the relationships present in the sentence. When a sentence has more than 1 RDF triplet possible, the output must contain all of them.\n\nAFC Ajax (amateurs)'s ground is Sportpark De Toekomst where Ajax Youth Academy also play.\nOutput:",
   'chosen': '[\n  ["AFC Ajax (amateurs)", "has ground", "Sportpark De Toekomst"],\n  ["Ajax Youth Academy", "plays at", "Sportpark De Toekomst"]\n]',
   'rejected': " Sure, I'd be happy to help! Here are the RDF triplets for the input sentence:\n\n[AFC Ajax (amateurs), hasGround, Sportpark De Toekomst]\n[Ajax Youth Academy, playsAt, Sportpark De Toekomst]\n\nExplanation:\n\n* AFC Ajax (amateurs) is the subject of the first triplet, and hasGround is the predicate that describes the relationship between AFC Ajax (amateurs) and Sportpark De Toekomst.\n* Ajax Youth Academy is the subject of the second triplet, and playsAt is the predicate that describes the relationship between Ajax Youth Academy and Sportpark De Toekomst.\n\nNote that there may be other possible RDF triplets that could be derived from the input sentence, but the above triplets capture the main relationships present in the sentence.",
   'generations': [" Sure, I'd be happy to help! Here are the RDF triplets for the input sentence:\n\n[AFC Ajax (amateurs), hasGround, Sportpark De Toekomst]\n[Ajax Youth Academy, playsAt, Sportpark De Toekomst]\n\nExplanation:\n\n* AFC Ajax (amateurs) is the subject of the first triplet, and hasGround is the predicate that describes the relationship between AFC Ajax (amateurs) and Sportpark De Toekomst.\n* Ajax Youth Academy is the subject of the second triplet, and playsAt is the predicate that describes the relationship between Ajax Youth Academy and Sportpark De Toekomst.\n\nNote that there may be other possible RDF triplets that could be derived from the input sentence, but the above triplets capture the main relationships present in the sentence.",
    '[\n  ["AFC Ajax (amateurs)", "has ground", "Sportpark De Toekomst"],\n  ["Ajax Youth Academy", "plays at", "Sportpark De Toekomst"]\n]'],
   'order': ['rejected', 'chosen']}],
 True)

评估响应

为了评估响应的质量,我们将使用 meta-llama/Meta-Llama-3.1-70B-Instruct,应用 UltraFeedback 任务,该任务根据不同的维度(helpfulnes、honesty、instruction-following、truthfulness)判断响应。对于 SFT 数据集,您可以改用 PrometheusEval

  • 组件: 使用 InferenceEndpointsLLM 的 LLM UltraFeedback 任务
  • 输入列: instruction, generations
  • 输出列: ratings, rationales, distilabel_metadata, model_name

对于您的用例并为了改进结果,您可以使用您选择的任何其他 LLM

evaluate_responses = UltraFeedback(
    aspect="overall-rating",
    llm=InferenceEndpointsLLM(
        model_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
        tokenizer_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
        generation_kwargs={"max_new_tokens": 512, "temperature": 0.7},
    ),
    pipeline=Pipeline(name="showcase-pipeline"),
)
evaluate_responses.load()
next(
    evaluate_responses.process(
        [
            {
                "instruction": "What's the capital of Spain?",
                "generations": ["Madrid", "Barcelona"],
            }
        ]
    )
)
[{'instruction': "What's the capital of Spain?",
  'generations': ['Madrid', 'Barcelona'],
  'ratings': [5, 1],
  'rationales': ["The answer is correct, directly addressing the question, and is free of hallucinations or unnecessary details. It confidently provides the accurate information, aligning perfectly with the user's intent.",
   "The answer is incorrect as Barcelona is not the capital of Spain. This introduces a significant inaccuracy, failing to provide helpful information and deviating entirely from the user's intent."],
  'distilabel_metadata': {'raw_output_ultra_feedback_0': "#### Output for Text 1\nRating: 5 (Excellent)\nRationale: The answer is correct, directly addressing the question, and is free of hallucinations or unnecessary details. It confidently provides the accurate information, aligning perfectly with the user's intent.\n\n#### Output for Text 2\nRating: 1 (Low Quality)\nRationale: The answer is incorrect as Barcelona is not the capital of Spain. This introduces a significant inaccuracy, failing to provide helpful information and deviating entirely from the user's intent."},
  'model_name': 'meta-llama/Meta-Llama-3.1-70B-Instruct'}]

仅保留必需的列

我们将删除不需要的列。

  • 组件: KeepColumns
  • 输入列: system, instruction, chosen, rejected, generations, ratings, rationales, distilabel_metadatamodel_name
  • 输出列: instruction, chosen, rejected, generationsorder
keep_columns = KeepColumns(
    columns=[
        "instruction",
        "generations",
        "order",
        "ratings",
        "rationales",
        "model_name",
    ],
    pipeline=Pipeline(name="showcase-pipeline"),
)
keep_columns.load()
next(
    keep_columns.process(
        [
            {
                "system": "",
                "instruction": "What's the capital of Spain?",
                "chosen": "Madrid",
                "rejected": "Barcelona",
                "generations": ["Madrid", "Barcelona"],
                "order": ["chosen", "rejected"],
                "ratings": [5, 1],
                "rationales": ["", ""],
                "model_name": "meta-llama/Meta-Llama-3.1-70B-Instruct",
            }
        ]
    )
)
[{'instruction': "What's the capital of Spain?",
  'generations': ['Madrid', 'Barcelona'],
  'order': ['chosen', 'rejected'],
  'ratings': [5, 1],
  'rationales': ['', ''],
  'model_name': 'meta-llama/Meta-Llama-3.1-70B-Instruct'}]

(可选) 进一步的数据整理

您可以使用 Argilla 进一步整理您的数据。

  • 组件: PreferenceToArgilla 步骤
  • 输入列: instruction, generations, generation_models, ratings
  • 输出列: instruction, generations, generation_models, ratings
to_argilla = PreferenceToArgilla(
    dataset_name="cleaned-dataset",
    dataset_workspace="argilla",
    api_url="https://[your-owner-name]-[your-space-name].hf.space",
    api_key="[your-api-key]",
    num_generations=2
)

运行管道

下面,您可以看到完整的管道定义

with Pipeline(name="clean-dataset") as pipeline:

    load_dataset = LoadDataFromDicts(
        data=dataset, output_mappings={"question": "instruction"}
    )

    evaluate_responses = UltraFeedback(
        aspect="overall-rating",
        llm=InferenceEndpointsLLM(
            model_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
            tokenizer_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
            generation_kwargs={"max_new_tokens": 512, "temperature": 0.7},
        ),
    )

    keep_columns = KeepColumns(
        columns=[
            "instruction",
            "generations",
            "order",
            "ratings",
            "rationales",
            "model_name",
        ]
    )

    to_argilla = PreferenceToArgilla(
        dataset_name="cleaned-dataset",
        dataset_workspace="argilla",
        api_url="https://[your-owner-name]-[your-space-name].hf.space",
        api_key="[your-api-key]",
        num_generations=2,
    )

    load_dataset.connect(evaluate_responses)
    evaluate_responses.connect(keep_columns)
    keep_columns.connect(to_argilla)

现在让我们运行管道并清理我们的偏好数据集。

distiset = pipeline.run()

让我们检查一下!如果您已将数据加载到 Argilla,则可以在 Argilla UI 中开始注释

您可以将数据集推送到 Hub 以与社区共享,并嵌入它以探索数据

distiset.push_to_hub("[your-owner-name]/example-cleaned-preference-dataset")

结论

在本教程中,我们展示了使用 distilabel 构建管道以清理偏好数据集的详细步骤。但是,您可以为自己的用例自定义此管道,例如清理 SFT 数据集或添加自定义步骤。

我们使用偏好数据集作为起点,并打乱了数据以避免任何偏差。接下来,我们通过无服务器 Hugging Face 推理 API 使用模型评估了响应,遵循 UltraFeedback 标准。最后,我们保留了需要的列并使用 Argilla 进行了进一步的整理。