UltraFeedback¶

UltraFeedback：通过高质量反馈提升语言模型是由 OpenBMB 发布的一篇论文，该论文提出了 UltraFeedback，一个大规模、细粒度、多样化的偏好数据集，用于训练强大的奖励模型和评论模型。

UltraFeedback 从不同的资源（包括 UltraChat、ShareGPT、Evol-Instruct、TruthfulQA、FalseQA 和 FLAN）收集了约 64k 个提示，然后他们使用这些提示来查询多个 LLM（商业模型、从 7B 到 70B 的 Llama 模型和非 Llama 模型），并为每个提示生成四个不同的响应，总共产生 256k 个样本，即 UltraFeedback 将对每个 OpenAI 请求的四个响应进行评分。

为了收集高质量的偏好和文本反馈，他们设计了一个细粒度的注释指令，其中包含四个不同的方面，即指令遵循、真实性、诚实性和 helpfulness（即使在论文中他们也提到了第五个方面，名为口头校准）。最后，GPT-4 用于使用前面提到的方面为给定提示的生成响应生成评级。

复制¶

为了复制这篇论文，我们将使用 distilabel 和 Hugging Face H4 团队创建的较小数据集 HuggingFaceH4/instruction-dataset 用于测试目的。

同样为了测试目的，我们将仅展示如何使用 Argilla 定义的名为 overall-rating 的新全局方面来评估给定提示的生成响应，该方面计算四个方面的平均值，以便减少发送到 OpenAI 的请求数量，但请注意，所有方面都在 distilabel 中实现，可以替代使用以更忠实地再现。除此之外，我们将使用从六个 LLM 池中选择的三个 LLM 为每个指令生成三个响应：HuggingFaceH4/zephyr-7b-beta、argilla/notus-7b-v1、google/gemma-1.1-7b-it、meta-llama/Meta-Llama-3-8B-Instruct、HuggingFaceH4/zephyr-7b-gemma-v0.1 和 mlabonne/UltraMerge-7B。

安装¶

要复制 UltraFeedback，需要按如下方式安装 distilabel

pip install "distilabel[argilla,openai,vllm]>=1.0.0"

并且由于我们将使用 vllm，我们将需要使用至少 6 个 NVIDIA GPU 的 VM，每个 GPU 至少有 16GB 内存才能运行文本生成，并设置 OPENAI_API_KEY 环境变量值。

构建模块¶

LoadDataFromHub：生成器步骤，用于从 Hugging Face Hub 加载数据集。
sample_n_steps：用于创建 routing_batch_function 的函数，该函数为上游步骤生成的每个批次采样 n 个下游步骤。这是复制论文中描述的 LLM 池化机制的关键。
TextGeneration：使用 LLM 为给定指令生成响应的任务。
- vLLM：使用 vllm 从 Hugging Face Hub 加载模型的 LLM。
GroupColumns：将多个列组合成单列的任务，即从字符串到字符串列表。当有多个并行步骤连接到同一节点时很有用。
UltraFeedback：使用 UltraFeedback 提示为给定指令的响应生成评级的任务。
- OpenAILLM：从 OpenAI 加载模型的 LLM。
KeepColumns：保留所需列，同时删除不需要的列的任务，以及定义这些列的顺序。
（可选）PreferenceToArgilla：可选地将生成的数据集推送到 Argilla 以进行进一步分析和人工注释的任务。

代码¶

如前所述，我们将把前面提到的构建模块放在一起以复制 UltraFeedback。

from distilabel.models import OpenAILLM, vLLM
from distilabel.pipeline import Pipeline, sample_n_steps
from distilabel.steps import (
    GroupColumns,
    KeepColumns,
    LoadDataFromHub,
    PreferenceToArgilla,
)
from distilabel.steps.tasks import TextGeneration, UltraFeedback

sample_three_llms = sample_n_steps(n=3)


with Pipeline(name="ultrafeedback-pipeline") as pipeline:
    load_hub_dataset = LoadDataFromHub(
        name="load_dataset",
        output_mappings={"prompt": "instruction"},
        batch_size=2,
    )

    text_generation_with_notus = TextGeneration(
        name="text_generation_with_notus",
        llm=vLLM(model="argilla/notus-7b-v1"),
        input_batch_size=2,
        output_mappings={"model_name": "generation_model"},
    )
    text_generation_with_zephyr = TextGeneration(
        name="text_generation_with_zephyr",
        llm=vLLM(model="HuggingFaceH4/zephyr-7b-gemma-v0.1"),
        input_batch_size=2,
        output_mappings={"model_name": "generation_model"},
    )
    text_generation_with_gemma = TextGeneration(
        name="text_generation_with_gemma",
        llm=vLLM(model="google/gemma-1.1-7b-it"),
        input_batch_size=2,
        output_mappings={"model_name": "generation_model"},
    )
    text_generation_with_zephyr_gemma = TextGeneration(
        name="text_generation_with_zephyr_gemma",
        llm=vLLM(model="HuggingFaceH4/zephyr-7b-gemma-v0.1"),
        input_batch_size=2,
        output_mappings={"model_name": "generation_model"},
    )
    text_generation_with_llama = TextGeneration(
        name="text_generation_with_llama",
        llm=vLLM(model="meta-llama/Meta-Llama-3-8B-Instruct"),
        input_batch_size=2,
        output_mappings={"model_name": "generation_model"},
    )
    text_generation_with_ultramerge = TextGeneration(
        name="text_generation_with_ultramerge",
        llm=vLLM(model="mlabonne/UltraMerge-7B"),
        input_batch_size=2,
        output_mappings={"model_name": "generation_model"},
    )

    combine_columns = GroupColumns(
        name="combine_columns",
        columns=["generation", "generation_model"],
        output_columns=["generations", "generation_models"],
        input_batch_size=2
    )

    ultrafeedback = UltraFeedback(
        name="ultrafeedback_openai",
        llm=OpenAILLM(model="gpt-4-turbo-2024-04-09"),
        aspect="overall-rating",
        output_mappings={"model_name": "ultrafeedback_model"},
    )

    keep_columns = KeepColumns(
        name="keep_columns",
        columns=[
            "instruction",
            "generations",
            "generation_models",
            "ratings",
            "rationales",
            "ultrafeedback_model",
        ],
    )

    (
        load_hub_dataset
        >> sample_three_llms
        >> [
            text_generation_with_notus,
            text_generation_with_zephyr,
            text_generation_with_gemma,
            text_generation_with_llama,
            text_generation_with_zephyr_gemma,
            text_generation_with_ultramerge
        ]
        >> combine_columns
        >> ultrafeedback
        >> keep_columns
    )

    # Optional: Push the generated dataset to Argilla, but will need to `pip install argilla` first
    # push_to_argilla = PreferenceToArgilla(
    #     name="push_to_argilla",
    #     api_url="<ARGILLA_API_URL>",
    #     api_key="<ARGILLA_API_KEY>",  # type: ignore
    #     dataset_name="ultrafeedback",
    #     dataset_workspace="admin",
    #     num_generations=2,
    # )
    # keep_columns >> push_to_argilla

注意

由于我们使用的是相对较小的数据集，因此我们设置了较低的 batch_size 和 input_batch_size，以便我们的 routing_batch_function 有更多的批次，即我们在用于生成响应的 LLM 上会有更多的多样性。当使用大型数据集时，建议使用更大的 batch_size 和 input_batch_size，以受益于 vLLM 针对更大批次大小的优化，这使得 Pipeline 执行速度更快。

然后我们需要使用运行时参数调用 pipeline.run，以便可以启动 Pipeline。

distiset = pipeline.run(
    parameters={
        load_hub_dataset.name: {
            "repo_id": "HuggingFaceH4/instruction-dataset",
            "split": "test",
        },
        text_generation_with_notus.name: {
            "llm": {
                "generation_kwargs": {
                    "max_new_tokens": 512,
                    "temperature": 0.7,
                }
            },
        },
        text_generation_with_zephyr.name: {
            "llm": {
                "generation_kwargs": {
                    "max_new_tokens": 512,
                    "temperature": 0.7,
                }
            },
        },
        text_generation_with_gemma.name: {
            "llm": {
                "generation_kwargs": {
                    "max_new_tokens": 512,
                    "temperature": 0.7,
                }
            },
        },
        text_generation_with_llama.name: {
            "llm": {
                "generation_kwargs": {
                    "max_new_tokens": 512,
                    "temperature": 0.7,
                }
            },
        },
        text_generation_with_zephyr_gemma.name: {
            "llm": {
                "generation_kwargs": {
                    "max_new_tokens": 512,
                    "temperature": 0.7,
                }
            },
        },
        text_generation_with_ultramerge.name: {
            "llm": {
                "generation_kwargs": {
                    "max_new_tokens": 512,
                    "temperature": 0.7,
                }
            },
        },
        ultrafeedback.name: {
            "llm": {
                "generation_kwargs": {
                    "max_new_tokens": 2048,
                    "temperature": 0.7,
                }
            },
        },
    }
)

最后，我们可以选择将生成的数据集（名为 Distiset）通过 push_to_hub 方法推送到 Hugging Face Hub，以便将在叶子步骤中生成的每个子集都推送到 Hub。

distiset.push_to_hub(
    "ultrafeedback-instruction-dataset",
    private=True,
)