生成偏好数据集¶

目标：为 DPO/ORPO 生成合成偏好数据集。
库：argilla, hf-inference-endpoints
组件：LoadDataFromHub, TextGeneration, UltraFeedback, GroupColumns, FormatTextGenerationDPO, PreferenceToArgilla, InferenceEndpointsLLM

开始使用¶

安装依赖项¶

要完成本教程，您需要通过 pip 安装 distilabel SDK 和一些第三方库。在本教程中，我们将使用免费但速率受限的 Hugging Face 无服务器推理 API，因此我们需要将其作为额外的 distilabel 依赖项安装。您可以通过运行以下命令来安装它们

!pip install "distilabel[hf-inference-endpoints]"

!pip install "transformers~=4.0" "torch~=2.0"

让我们进行所需的导入

from distilabel.models import InferenceEndpointsLLM
from distilabel.pipeline import Pipeline
from distilabel.steps import (
    LoadDataFromHub,
    GroupColumns,
    FormatTextGenerationDPO,
    PreferenceToArgilla,
)
from distilabel.steps.tasks import TextGeneration, UltraFeedback

您需要一个 HF_TOKEN 才能使用 HF Inference Endpoints。登录以在本笔记本中直接使用它。

import os
from huggingface_hub import login

login(token=os.getenv("HF_TOKEN"), add_to_git_credential=True)

（可选）部署 Argilla¶

您可以跳过此步骤或将其替换为任何其他数据评估工具，但是您的模型质量会因缺乏数据质量而受到影响，因此我们建议您查看您的数据。如果您已经部署了 Argilla，则可以跳过此步骤。否则，您可以按照本指南快速部署 Argilla。

同时，您还需要将 Argilla 作为 distilabel 的额外组件安装。

!pip install "distilabel[argilla, hf-inference-endpoints]"

定义 Pipeline¶

为了生成我们的偏好数据集，我们将需要定义一个包含所有必要步骤的 Pipeline。下面，我们将详细介绍每个步骤。

加载数据集¶

我们将使用来自 Hugging Face Hub 的 argilla/10Kprompts-mini 数据集作为源数据。

组件：LoadDataFromHub
输入列：instruction 和 topic，与加载的数据集中的相同
输出列：instruction 和 topic

load_dataset = LoadDataFromHub(
        repo_id= "argilla/10Kprompts-mini",
        num_examples=1,
        pipeline=Pipeline(name="showcase-pipeline"),
    )
load_dataset.load()
next(load_dataset.process())

([{'instruction': 'How can I create an efficient and robust workflow that utilizes advanced automation techniques to extract targeted data, including customer information, from diverse PDF documents and effortlessly integrate it into a designated Google Sheet? Furthermore, I am interested in establishing a comprehensive and seamless system that promptly activates an SMS notification on my mobile device whenever a new PDF document is uploaded to the Google Sheet, ensuring real-time updates and enhanced accessibility.',
   'topic': 'Software Development'}],
 True)

生成响应¶

我们需要为给定的指令生成响应。我们将使用 Hugging Face Hub 上通过 Serverless Inference API 提供的两个不同的模型：meta-llama/Meta-Llama-3-8B-Instruct 和 mistralai/Mixtral-8x7B-Instruct-v0.1。我们还将指示每个模型的生成参数。

组件：使用 InferenceEndpointsLLM 的 LLM 的 TextGeneration 任务
输入列：instruction
输出列：每个模型的 generation、distilabel_metadata、model_name

对于您的用例并为了改进结果，您可以使用您选择的任何其他 LLM。

generate_responses = [
    TextGeneration(
        llm=InferenceEndpointsLLM(
            model_id="meta-llama/Meta-Llama-3-8B-Instruct",
            tokenizer_id="meta-llama/Meta-Llama-3-8B-Instruct",
            generation_kwargs={"max_new_tokens": 512, "temperature": 0.7},
        ),
        pipeline=Pipeline(name="showcase-pipeline"),
    ),
    TextGeneration(
        llm=InferenceEndpointsLLM(
            model_id="mistralai/Mixtral-8x7B-Instruct-v0.1",
            tokenizer_id="mistralai/Mixtral-8x7B-Instruct-v0.1",
            generation_kwargs={"max_new_tokens": 512, "temperature": 0.7},
        ),
        pipeline=Pipeline(name="showcase-pipeline"),
    ),
]
for task in generate_responses:
    task.load()
    print(next(task.process([{"instruction": "Which are the top cities in Spain?"}])))

[{'instruction': 'Which are the top cities in Spain?', 'generation': 'Spain is a country with a rich culture, history, and architecture, and it has many great cities to visit. Here are some of the top cities in Spain:\n\n1. **Madrid**: The capital city of Spain, known for its vibrant nightlife, museums, and historic landmarks like the Royal Palace and Prado Museum.\n2. **Barcelona**: The second-largest city in Spain, famous for its modernist architecture, beaches, and iconic landmarks like La Sagrada Família and Park Güell, designed by Antoni Gaudí.\n3. **Valencia**: Located on the Mediterranean coast, Valencia is known for its beautiful beaches, City of Arts and Sciences, and delicious local cuisine, such as paella.\n4. **Seville**: The capital of Andalusia, Seville is famous for its stunning cathedral, Royal Alcázar Palace, and lively flamenco music scene.\n5. **Málaga**: A coastal city in southern Spain, Málaga is known for its rich history, beautiful beaches, and being the birthplace of Pablo Picasso.\n6. **Zaragoza**: Located in the northeastern region of Aragon, Zaragoza is a city with a rich history, known for its Roman ruins, Gothic cathedral, and beautiful parks.\n7. **Granada**: A city in the Andalusian region, Granada is famous for its stunning Alhambra palace and generalife gardens, a UNESCO World Heritage Site.\n8. **Bilbao**: A city in the Basque Country, Bilbao is known for its modern architecture, including the Guggenheim Museum, and its rich cultural heritage.\n9. **Alicante**: A coastal city in the Valencia region, Alicante is famous for its beautiful beaches, historic castle, and lively nightlife.\n10. **San Sebastián**: A city in the Basque Country, San Sebastián is known for its stunning beaches, gastronomic scene, and cultural events like the San Sebastián International Film Festival.\n\nThese are just a few of the many great cities in Spain, each with its own unique character and attractions.', 'distilabel_metadata': {'raw_output_text_generation_0': 'Spain is a country with a rich culture, history, and architecture, and it has many great cities to visit. Here are some of the top cities in Spain:\n\n1. **Madrid**: The capital city of Spain, known for its vibrant nightlife, museums, and historic landmarks like the Royal Palace and Prado Museum.\n2. **Barcelona**: The second-largest city in Spain, famous for its modernist architecture, beaches, and iconic landmarks like La Sagrada Família and Park Güell, designed by Antoni Gaudí.\n3. **Valencia**: Located on the Mediterranean coast, Valencia is known for its beautiful beaches, City of Arts and Sciences, and delicious local cuisine, such as paella.\n4. **Seville**: The capital of Andalusia, Seville is famous for its stunning cathedral, Royal Alcázar Palace, and lively flamenco music scene.\n5. **Málaga**: A coastal city in southern Spain, Málaga is known for its rich history, beautiful beaches, and being the birthplace of Pablo Picasso.\n6. **Zaragoza**: Located in the northeastern region of Aragon, Zaragoza is a city with a rich history, known for its Roman ruins, Gothic cathedral, and beautiful parks.\n7. **Granada**: A city in the Andalusian region, Granada is famous for its stunning Alhambra palace and generalife gardens, a UNESCO World Heritage Site.\n8. **Bilbao**: A city in the Basque Country, Bilbao is known for its modern architecture, including the Guggenheim Museum, and its rich cultural heritage.\n9. **Alicante**: A coastal city in the Valencia region, Alicante is famous for its beautiful beaches, historic castle, and lively nightlife.\n10. **San Sebastián**: A city in the Basque Country, San Sebastián is known for its stunning beaches, gastronomic scene, and cultural events like the San Sebastián International Film Festival.\n\nThese are just a few of the many great cities in Spain, each with its own unique character and attractions.'}, 'model_name': 'meta-llama/Meta-Llama-3-8B-Instruct'}]
[{'instruction': 'Which are the top cities in Spain?', 'generation': ' Here are some of the top cities in Spain based on various factors such as tourism, culture, history, and quality of life:\n\n1. Madrid: The capital and largest city in Spain, Madrid is known for its vibrant nightlife, world-class museums (such as the Prado Museum and Reina Sofia Museum), stunning parks (such as the Retiro Park), and delicious food.\n\n2. Barcelona: Famous for its unique architecture, Barcelona is home to several UNESCO World Heritage sites designed by Antoni Gaudí, including the Sagrada Familia and Park Güell. The city also boasts beautiful beaches, a lively arts scene, and delicious Catalan cuisine.\n\n3. Valencia: A coastal city located in the east of Spain, Valencia is known for its City of Arts and Sciences, a modern architectural complex that includes a planetarium, opera house, and museum of interactive science. The city is also famous for its paella, a traditional Spanish dish made with rice, vegetables, and seafood.\n\n4. Seville: The capital of Andalusia, Seville is famous for its flamenco dancing, stunning cathedral (the largest Gothic cathedral in the world), and the Alcázar, a beautiful palace made up of a series of rooms and courtyards.\n\n5. Granada: Located in the foothills of the Sierra Nevada mountains, Granada is known for its stunning Alhambra palace, a Moorish fortress that dates back to the 9th century. The city is also famous for its tapas, a traditional Spanish dish that is often served for free with drinks.\n\n6. Bilbao: A city in the Basque Country, Bilbao is famous for its modern architecture, including the Guggenheim Museum, a contemporary art museum designed by Frank Gehry. The city is also known for its pintxos, a type of Basque tapas that are served in bars and restaurants.\n\n7. Málaga: A coastal city in Andalusia, Málaga is known for its beautiful beaches, historic sites (including the Alcazaba and Gibralfaro castles), and the Picasso Museum, which is dedicated to the famous Spanish artist who was born in the city.\n\nThese are just a few of the many wonderful cities in Spain.', 'distilabel_metadata': {'raw_output_text_generation_0': ' Here are some of the top cities in Spain based on various factors such as tourism, culture, history, and quality of life:\n\n1. Madrid: The capital and largest city in Spain, Madrid is known for its vibrant nightlife, world-class museums (such as the Prado Museum and Reina Sofia Museum), stunning parks (such as the Retiro Park), and delicious food.\n\n2. Barcelona: Famous for its unique architecture, Barcelona is home to several UNESCO World Heritage sites designed by Antoni Gaudí, including the Sagrada Familia and Park Güell. The city also boasts beautiful beaches, a lively arts scene, and delicious Catalan cuisine.\n\n3. Valencia: A coastal city located in the east of Spain, Valencia is known for its City of Arts and Sciences, a modern architectural complex that includes a planetarium, opera house, and museum of interactive science. The city is also famous for its paella, a traditional Spanish dish made with rice, vegetables, and seafood.\n\n4. Seville: The capital of Andalusia, Seville is famous for its flamenco dancing, stunning cathedral (the largest Gothic cathedral in the world), and the Alcázar, a beautiful palace made up of a series of rooms and courtyards.\n\n5. Granada: Located in the foothills of the Sierra Nevada mountains, Granada is known for its stunning Alhambra palace, a Moorish fortress that dates back to the 9th century. The city is also famous for its tapas, a traditional Spanish dish that is often served for free with drinks.\n\n6. Bilbao: A city in the Basque Country, Bilbao is famous for its modern architecture, including the Guggenheim Museum, a contemporary art museum designed by Frank Gehry. The city is also known for its pintxos, a type of Basque tapas that are served in bars and restaurants.\n\n7. Málaga: A coastal city in Andalusia, Málaga is known for its beautiful beaches, historic sites (including the Alcazaba and Gibralfaro castles), and the Picasso Museum, which is dedicated to the famous Spanish artist who was born in the city.\n\nThese are just a few of the many wonderful cities in Spain.'}, 'model_name': 'mistralai/Mixtral-8x7B-Instruct-v0.1'}]

分组响应¶

评估响应的任务需要将生成列表作为输入。但是，每个模型响应都保存在子集 text_generation_0 和 text_generation_1 的 generation 列中。我们将把这两列合并为单列和 default 子集。

组件：GroupColumns
输入列：来自 text_generation_0 和 text_generation_1 的 generation 和 model_name
输出列：generations 和 model_names

group_responses = GroupColumns(
    columns=["generation", "model_name"],
    output_columns=["generations", "model_names"],
    pipeline=Pipeline(name="showcase-pipeline"),
)
next(
    group_responses.process(
        [
            {
                "generation": "Madrid",
                "model_name": "meta-llama/Meta-Llama-3-8B-Instruct",
            },
        ],
        [
            {
                "generation": "Barcelona",
                "model_name": "mistralai/Mixtral-8x7B-Instruct-v0.1",
            }
        ],
    )
)

[{'generations': ['Madrid', 'Barcelona'],
  'model_names': ['meta-llama/Meta-Llama-3-8B-Instruct',
   'mistralai/Mixtral-8x7B-Instruct-v0.1']}]

评估响应¶

为了构建我们的偏好数据集，我们需要评估模型生成的响应。我们将使用 meta-llama/Meta-Llama-3-70B-Instruct 来执行此操作，应用 UltraFeedback 任务，该任务根据不同的维度（帮助性、诚实性、指令遵循、真实性）判断响应。

组件：使用 InferenceEndpointsLLM 的 LLM 的 UltraFeedback 任务
输入列：instruction、generations
输出列：ratings、rationales、distilabel_metadata、model_name

对于您的用例并为了改进结果，您可以使用您选择的任何其他 LLM。

evaluate_responses = UltraFeedback(
    aspect="overall-rating",
    llm=InferenceEndpointsLLM(
        model_id="meta-llama/Meta-Llama-3-70B-Instruct",
        tokenizer_id="meta-llama/Meta-Llama-3-70B-Instruct",
        generation_kwargs={"max_new_tokens": 512, "temperature": 0.7},
    ),
    pipeline=Pipeline(name="showcase-pipeline"),
)
evaluate_responses.load()
next(
    evaluate_responses.process(
        [
            {
                "instruction": "What's the capital of Spain?",
                "generations": ["Madrid", "Barcelona"],
            }
        ]
    )
)

[{'instruction': "What's the capital of Spain?",
  'generations': ['Madrid', 'Barcelona'],
  'ratings': [5, 1],
  'rationales': ["The answer is correct, directly addressing the question, and is free of hallucinations or unnecessary details. It confidently provides the accurate information, aligning perfectly with the user's intent.",
   "The answer is incorrect as Barcelona is not the capital of Spain. This introduces a significant inaccuracy, failing to provide helpful information and deviating entirely from the user's intent."],
  'distilabel_metadata': {'raw_output_ultra_feedback_0': "#### Output for Text 1\nRating: 5 (Excellent)\nRationale: The answer is correct, directly addressing the question, and is free of hallucinations or unnecessary details. It confidently provides the accurate information, aligning perfectly with the user's intent.\n\n#### Output for Text 2\nRating: 1 (Low Quality)\nRationale: The answer is incorrect as Barcelona is not the capital of Spain. This introduces a significant inaccuracy, failing to provide helpful information and deviating entirely from the user's intent."},
  'model_name': 'meta-llama/Meta-Llama-3-70B-Instruct'}]

转换为偏好数据集¶

您可以使用 chosen 和 rejected 列自动将其转换为偏好数据集。
- 组件：FormatTextGenerationDPO 步骤
- 输入列：instruction、generations、generation_models、ratings
- 输出列：prompt、prompt_id、chosen、chosen_model、chosen_rating、rejected、rejected_model、rejected_rating

format_dpo = FormatTextGenerationDPO(pipeline=Pipeline(name="showcase-pipeline"))
format_dpo.load()
next(
    format_dpo.process(
        [
            {
                "instruction": "What's the capital of Spain?",
                "generations": ["Madrid", "Barcelona"],
                "generation_models": [
                    "Meta-Llama-3-8B-Instruct",
                    "Mixtral-8x7B-Instruct-v0.1",
                ],
                "ratings": [5, 1],
            }
        ]
    )
)

[{'instruction': "What's the capital of Spain?",
  'generations': ['Madrid', 'Barcelona'],
  'generation_models': ['Meta-Llama-3-8B-Instruct',
   'Mixtral-8x7B-Instruct-v0.1'],
  'ratings': [5, 1],
  'prompt': "What's the capital of Spain?",
  'prompt_id': '26174c953df26b3049484e4721102dca6b25d2de9e3aa22aa84f25ed1c798512',
  'chosen': [{'role': 'user', 'content': "What's the capital of Spain?"},
   {'role': 'assistant', 'content': 'Madrid'}],
  'chosen_model': 'Meta-Llama-3-8B-Instruct',
  'chosen_rating': 5,
  'rejected': [{'role': 'user', 'content': "What's the capital of Spain?"},
   {'role': 'assistant', 'content': 'Barcelona'}],
  'rejected_model': 'Mixtral-8x7B-Instruct-v0.1',
  'rejected_rating': 1}]

或者，您可以使用 Argilla 手动标记数据并将其转换为偏好数据集。
- 组件：PreferenceToArgilla 步骤
- 输入列：instruction、generations、generation_models、ratings
- 输出列：instruction、generations、generation_models、ratings

to_argilla = PreferenceToArgilla(
    dataset_name="preference-dataset",
    dataset_workspace="argilla",
    api_url="https://[your-owner-name]-[your-space-name].hf.space",
    api_key="[your-api-key]",
    num_generations=2
)

运行 Pipeline¶

下面，您可以看到完整的 Pipeline 定义

with Pipeline(name="generate-dataset") as pipeline:

    load_dataset = LoadDataFromHub(repo_id="argilla/10Kprompts-mini")

    generate_responses = [
        TextGeneration(
            llm=InferenceEndpointsLLM(
                model_id="meta-llama/Meta-Llama-3-8B-Instruct",
                tokenizer_id="meta-llama/Meta-Llama-3-8B-Instruct",
                generation_kwargs={"max_new_tokens": 512, "temperature": 0.7},
            )
        ),
        TextGeneration(
            llm=InferenceEndpointsLLM(
                model_id="mistralai/Mixtral-8x7B-Instruct-v0.1",
                tokenizer_id="mistralai/Mixtral-8x7B-Instruct-v0.1",
                generation_kwargs={"max_new_tokens": 512, "temperature": 0.7},
            )
        ),
    ]

    group_responses = GroupColumns(
        columns=["generation", "model_name"],
        output_columns=["generations", "model_names"],
    )

    evaluate_responses = UltraFeedback(
        aspect="overall-rating",
        llm=InferenceEndpointsLLM(
            model_id="meta-llama/Meta-Llama-3-70B-Instruct",
            tokenizer_id="meta-llama/Meta-Llama-3-70B-Instruct",
            generation_kwargs={"max_new_tokens": 512, "temperature": 0.7},
        )
    )

    format_dpo = FormatTextGenerationDPO()

    to_argilla = PreferenceToArgilla(
        dataset_name="preference-dataset",
        dataset_workspace="argilla",
        api_url="https://[your-owner-name]-[your-space-name].hf.space",
        api_key="[your-api-key]",
        num_generations=2
    )

    for task in generate_responses:
        load_dataset.connect(task)
        task.connect(group_responses)
    group_responses.connect(evaluate_responses)
    evaluate_responses.connect(format_dpo, to_argilla)

现在让我们运行 Pipeline 并生成偏好数据集。

distiset = pipeline.run()

让我们检查偏好数据集！如果您已将数据加载到 Argilla，则可以在 Argilla UI 中开始注释。

您可以将数据集推送到 Hub 以与社区共享，并嵌入它以探索数据。

distiset.push_to_hub("[your-owner-name]/example-preference-dataset")

结论¶

在本教程中，我们展示了使用 distilabel 构建用于生成偏好数据集的 Pipeline 的详细步骤。您可以为自己的用例自定义此 Pipeline，并通过 Hugging Face Hub 与社区共享您的数据集，或使用它们来训练 DPO 或 ORPO 模型。

我们使用包含提示的数据集，通过无服务器 Hugging Face Inference API 使用两个不同的模型生成响应。接下来，我们使用第三个模型评估响应，遵循 UltraFeedback 标准。最后，我们将数据转换为偏好数据集，并使用 Argilla 进行进一步的策展。