DEITA¶

DEITA（数据高效指令调优对齐）研究了一种自动数据选择过程，首先基于复杂性、质量和多样性量化数据质量。其次，从开源数据集中选择最佳潜在组合，以适应您为调整自己的 LLM 分配的预算。

在大多数设置中，我们无法为指令调优 LLM 分配无限的资源。因此，DEITA 作者研究了如何基于少量高质量样本的原则选择用于指令调优的定性数据。Liu 等人解决了两个问题：首先定义好的数据，其次识别好的数据以满足用于指令调优 LLM 的初始预算。

该策略利用LLM 来取代人类在耗时的指令调优数据集上的数据质量任务中的工作。DEITA 引入了一种跨三个关键维度（复杂性、质量和多样性）衡量数据质量的方法。

您可以看到，我们再次看到了指令/响应数据集，并且我们在学习如何根据指令通过比较几种可能性来优化响应时，有点像重现了第二步。

数据集和预算¶

我们将深入探讨整个过程。我们将研究每个阶段，以有效地选择用于在预算约束下进行监督式微调的最终数据集。我们将通过确切解释如何评估论文中提出的好数据来解决技术挑战。

作为提醒，我们正在寻找一种策略，当您想要根据资源约束微调 LLM 以适应您自己的用例时，自动选择用于指令调优步骤的优质数据。这意味着您不能盲目地在互联网上遇到的任何数据上训练模型。

DEITA 作者假设您可以访问适合您用例的开源数据集。情况可能并非完全如此。但是，随着开源社区解决许多用例，以及诸如 BLOOM 或 AYA 等项目，您的用例很可能在某个时候得到解决。此外，您可以使用 distilabel 等方法生成自己的指令/响应对，例如自生成指令。本教程假设我们有一个数据池，其中包含超出项目成本约束的过量样本。简而言之，我们的目标是从更少的样本中获得足够的性能。

作者声称，子样本大小“与指令调优中消耗的计算量成正比”。因此，在初步近似中，减少样本大小意味着减少计算消耗，从而降低总开发成本。为了再现论文的符号，我们将预算 m 与您可以根据实际预算设置的指令/响应对的数量相关联。

为了匹配实验设置，数据集 X_sota 是一个元数据集，它结合了可用于指令调优 LLM 的主要开源数据集。该数据集由 ShareGPT（58k 指令/响应对）、UltraChat（105k 指令/响应对）和 WizardLM（143k 指令/响应对）组成。总计超过 300k 指令/响应对。我们的目标是将最终子样本减少到 6k 指令/响应对。

设置 Notebook 和软件包¶

让我们准备好我们的依赖项

pip install "distilabel[openai,hf-transformers]>=1.0.0"
pip install pynvml huggingface_hub argilla

导入 distilabel

from distilabel.models import TransformersLLM, OpenAILLM
from distilabel.pipeline import Pipeline
from distilabel.steps import ConversationTemplate, DeitaFiltering, ExpandColumns, LoadDataFromHub
from distilabel.steps.tasks import ComplexityScorer, EvolInstruct, EvolQuality, GenerateEmbeddings, QualityScorer

定义 distilabel Pipeline 并从 Hugging Face Hub 加载数据集。

pipeline = Pipeline(name="DEITA")

load_data = LoadDataFromHub(
    name="load_data", batch_size=100, output_mappings={"prompt": "instruction"}, pipeline=pipeline
)

EVOL-INSTRUCT：使用 LLM 生成指令¶

Evol-Instruct 通过逐步将初始指令集重写为更复杂的形式，自动创建用于训练大型语言模型 (LLM) 的复杂指令数据。然后，将生成的数据用于微调名为 WizardLM 的模型。

评估表明，来自 Evol-Instruct 的指令优于人工创建的指令，并且 WizardLM 在许多技能方面达到了接近或超过 GPT3.5-turbo 的性能。在 distilabel 中，我们初始化数据生成 Pipeline 的每个步骤。稍后，我们将它们连接在一起。

evol_instruction_complexity = EvolInstruct(
    name="evol_instruction_complexity",
    llm=OpenAILLM(model="gpt-3.5-turbo"),
    num_evolutions=5,
    store_evolutions=True,
    generate_answers=True,
    include_original_instruction=True,
    pipeline=pipeline,
)

evol_instruction_complexity.load()

_evolved_instructions = next(evol_instruction_complexity.process(
    ([{"instruction": "How many fish are there in a dozen fish?"}]))
)

print(*_evolved_instructions, sep="\n")

输出

( 1, 'How many fish are there in a dozen fish?')
( 2, 'How many rainbow trout are there in a dozen rainbow trout?')
( 3, 'What is the average weight in pounds of a dozen rainbow trout caught in a specific river in Alaska during the month of May?')

EVOL COMPLEXITY：评估生成指令的复杂性¶

第二步是评估给定指令-响应对中指令的复杂性。与 EVOL-INSTRUCT 类似，此方法使用 LLM 而不是人类来自动改进指令，特别是通过其复杂性。从任何指令-响应对 \((I, R)\) 开始，我们首先按照深入演化响应生成新指令。正如作者所解释的那样，我们通过提示，通过添加一些约束或推理步骤来生成更复杂的指令。让我们以 GPT-4-LLM 中的一个示例为例，该示例旨在通过 GPT-4 生成观察结果，以使用监督式微调来指令调优 LLM。并且，我们有指令 \(instruction_0\)

instruction_0 = "Give three tips for staying healthy."

为了使其更复杂，您可以像作者所做的那样，使用一些提示模板来添加约束或加深指令。他们在论文附录中提供了一些提示。例如，这个提示用于添加约束

PROMPT = """I want you act as a Prompt Rewriter.
Your objective is to rewrite a given prompt into a more complex version to
make those famous AI systems (e.g., ChatGPT and GPT4) a bit harder to handle.
But the rewritten prompt must be reasonable and must be understood and
responded by humans.
Your rewriting cannot omit the non-text parts such as the table and code in
#Given Prompt#:. Also, please do not omit the input in #Given Prompt#.
You SHOULD complicate the given prompt using the following method:
Please add one more constraints/requirements into #Given Prompt#
You should try your best not to make the #Rewritten Prompt# become verbose,
#Rewritten Prompt# can only add 10 to 20 words into #Given Prompt#.
‘#Given Prompt#’, ‘#Rewritten Prompt#’, ‘given prompt’ and ‘rewritten prompt’
are not allowed to appear in #Rewritten Prompt#
#Given Prompt#:
<Here is instruction>
#Rewritten Prompt#:
"""

将此提示到 LLM，您将自动从初始指令 \(instruction_0\) 获得更复杂的指令，称为 \(instruction_1\)

instruction_1 = "Provide three recommendations for maintaining well-being, ensuring one focuses on mental health."

通过演化的指令序列，我们使用进一步的 LLM 来自动对它们进行排名和评分。我们同时提供 6 条指令。通过一起提供所有指令，我们强制评分模型查看演化指令之间细微的复杂性差异。鼓励模型区分指令。以下面的示例为例，\(instruction_0\) 和 \(instruction_1\) 可以独立获得相同的分数，但当放在一起比较时，我们会注意到使 \(instruction_1\) 更复杂的细微差异。

在 distilabel 中，我们像这样实现它

instruction_complexity_scorer = ComplexityScorer(
    name="instruction_complexity_scorer",
    llm=OpenAILLM(model="gpt-3.5-turbo"),
    input_mappings={"instructions": "evolved_instructions"},
    pipeline=pipeline,
)

expand_evolved_instructions = ExpandColumns(
    name="expand_evolved_instructions",
    columns=["evolved_instructions", "answers", "scores"],
    output_mappings={
        "evolved_instructions": "evolved_instruction",
        "answers": "answer",
        "scores": "evol_instruction_score",
    },
    pipeline=pipeline,
)

instruction_complexity_scorer.load()

_evolved_instructions = next(instruction_complexity_scorer.process(([{"evolved_instructions": [PROMPT + instruction_1]}])))

print("Original Instruction:")
print(instruction_1)
print("\nEvolved Instruction:")
print(_evolved_instructions[0]["evolved_instructions"][0].split("#Rewritten Prompt#:\n")[1])

输出

Original Instruction:
Provide three recommendations for maintaining well-being, ensuring one focuses on mental health.

Evolved Instruction:
Suggest three strategies for nurturing overall well-being, with the stipulation that at least one explicitly addresses the enhancement of mental health, incorporating evidence-based practices.

EVOL-QUALITY：质量评估¶

现在我们已经对指令的复杂性进行了评分，我们将专注于响应的质量。与 EVOL COMPLEXITY 类似，作者引入了 EVOL QUALITY，一种基于 LLM 而不是人类的方法，用于自动对响应质量进行评分。

从指令-响应对 \((I, R)\) 开始，目标是使响应演变为更有帮助和更相关的响应。关键区别在于我们还需要提供第一个指令来指导演化。让我们回顾一下 GPT-4-LLM 中的示例。

这里我们有响应 \(response_0\) 及其初始指令 \(instruction_0\)

instruction_0 = "Give three tips for staying healthy."
reponse_0 = "1. Eat a balanced and nutritious diet: Make sure your meals are inclusive of a variety of fruits and vegetables, lean protein, whole grains, and healthy fats. This helps to provide your body with the essential nutrients to function at its best and can help prevent chronic diseases. 2. Engage in regular physical activity: Exercise is crucial for maintaining strong bones, muscles, and cardiovascular health. Aim for at least 150 minutes of moderate aerobic exercise or 75 minutes of vigorous exercise each week. 3. Get enough sleep: Getting enough quality sleep is crucial for physical and mental well-being. It helps to regulate mood, improve cognitive function, and supports healthy growth and immune function. Aim for 7-9 hours of sleep each night."

同样，作者提供了一些提示，您可以使用这些提示来使您的响应根据某些指南进行演化。例如，这个提示用于丰富答案

PROMPT = """I want you to act as a Response Rewriter
Your goal is to enhance the quality of the response given by an AI assistant
to the #Given Prompt# through rewriting.
But the rewritten response must be reasonable and must be understood by humans.
Your rewriting cannot omit the non-text parts such as the table and code in
#Given Prompt# and #Given Response#. Also, please do not omit the input
in #Given Prompt#.
You Should enhance the quality of the response using the following method:
Please make the Response more in-depth
You should try your best not to make the #Rewritten Response# become verbose,
#Rewritten Response# can only add 10 to 20 words into #Given Response#.
‘#Given Response#’, ‘#Rewritten Response#’, ‘given response’ and ‘rewritten response’
are not allowed to appear in #Rewritten Response#
#Given Prompt#:
<instruction_0>
#Given Response#:
<response_0>
#Rewritten Response#:
"""

将此提示到 LLM，您将自动从初始响应 \(response_0\) 和初始指令 \(instruction_0\) 获得更丰富的响应，称为 \(response_1\)

evol_response_quality = EvolQuality(
    name="evol_response_quality",
    llm=OpenAILLM(model="gpt-3.5-turbo"),
    num_evolutions=5,
    store_evolutions=True,
    include_original_response=True,
    input_mappings={
        "instruction": "evolved_instruction",
        "response": "answer",
    },
    pipeline=pipeline,
)

evol_response_quality.load()

_evolved_responses = next(evol_response_quality.process([{"instruction": PROMPT + instruction_0, "response": reponse_0}]))

print("Original Response:")
print(reponse_0)
print("\nEvolved Response:")
print(*_evolved_responses[0]['evolved_responses'], sep="\n")

现在，就像在 EVOL COMPLEXITY 中一样，您迭代此路径并使用不同的提示来使您的响应更相关、更有帮助或更具创造性。在论文中，他们进行了 4 次以上的迭代，以获得 5 个演化的响应 \((R0, R1, R2, R3, R4)\)，这使得在此步骤结束时，一条初始指令有 5 个不同的响应。

response_quality_scorer = QualityScorer(
    name="response_quality_scorer",
    llm=OpenAILLM(model="gpt-3.5-turbo"),
    input_mappings={
        "instruction": "evolved_instruction",
        "responses": "evolved_responses",
    },
    pipeline=pipeline,
)

expand_evolved_responses = ExpandColumns(
    name="expand_evolved_responses",
    columns=["evolved_responses", "scores"],
    output_mappings={
        "evolved_responses": "evolved_response",
        "scores": "evol_response_score",
    },
    pipeline=pipeline,
)

response_quality_scorer.load()

_scored_responses = next(response_quality_scorer.process([{"instruction": PROMPT + instruction_0, "responses": _evolved_responses[0]['evolved_responses']}]))

print("Original Response:")
print(reponse_0)

print("\nScore, Evolved Response:")
print(*zip(_scored_responses[0]["scores"], _evolved_responses[0]['evolved_responses']), sep="\n")

输出

Original Response:
1. Eat a balanced and nutritious diet: Make sure your meals are inclusive of a variety of fruits and vegetables, lean protein, whole grains, and healthy fats. This helps to provide your body with the essential nutrients to function at its best and can help prevent chronic diseases. 2. Engage in regular physical activity: Exercise is crucial for maintaining strong bones, muscles, and cardiovascular health. Aim for at least 150 minutes of moderate aerobic exercise or 75 minutes of vigorous exercise each week. 3. Get enough sleep: Getting enough quality sleep is crucial for physical and mental well-being. It helps to regulate mood, improve cognitive function, and supports healthy growth and immune function. Aim for 7-9 hours of sleep each night.

Score, Evolved Response:
(4.0, 'Here are three essential tips for maintaining good health: \n1. Prioritize regular exercise \n2. Eat a balanced diet with plenty of fruits and vegetables \n3. Get an adequate amount of sleep each night.')
(2.0, 'Here are three effective strategies to maintain a healthy lifestyle.')
(5.0, 'Here are three practical tips to maintain good health: Ensure a balanced diet, engage in regular exercise, and prioritize sufficient sleep. These practices support overall well-being.')

提高数据多样性¶

指令调优 LLM 的优质数据的一个主要组成部分是多样性。现实世界的数据通常可能包含冗余，因为数据重复且同质。

DEITA 论文的作者解决了确保指令调优 LLM 中数据多样性的挑战，以避免可能导致过度拟合或泛化能力差的数据冗余的陷阱。他们提出了一种基于嵌入的方法来过滤数据以实现多样性。这种方法称为 Repr Filter，使用 Llama 1 13B 模型生成的嵌入来表示向量空间中的指令-响应对。新数据样本的多样性是根据其嵌入与其在已选数据集中的最近邻居的嵌入之间的余弦距离来评估的。如果此距离大于指定的阈值，则该样本被认为是多样的，并添加到选择中。此过程通过评估每个样本对数据集多样性的贡献来优先考虑多样性，直到满足数据选择预算为止。这种方法有效地保持了用于指令调优的数据的多样性，正如 DEITA 模型以显着减少的训练数据优于或匹配最先进的模型所证明的那样。在 DEITA 的此实现中，我们使用 Llama 2 模型的最后一层的隐藏状态来生成嵌入，而不是句子转换器模型，因为我们发现它可以提高数据选择的多样性。

generate_conversation = ConversationTemplate(
    name="generate_conversation",
    input_mappings={
        "instruction": "evolved_instruction",
        "response": "evolved_response",
    },
    pipeline=pipeline,
)

generate_embeddings = GenerateEmbeddings(
    name="generate_embeddings",
    llm=TransformersLLM(
        model="TinyLlama/TinyLlama-1.1B-Chat-v1.0",
        device="cuda",
        torch_dtype="float16",
    ),
    input_mappings={"text": "conversation"},
    input_batch_size=5,
    pipeline=pipeline,
)

deita_filtering = DeitaFiltering(name="deita_filtering", pipeline=pipeline)

构建 ⚗ distilabel `Pipeline`¶

现在我们已准备好使用 DEITA 方法构建 distilabel Pipeline

load_data.connect(evol_instruction_complexity)
evol_instruction_complexity.connect(instruction_complexity_scorer)
instruction_complexity_scorer.connect(expand_evolved_instructions)
expand_evolved_instructions.connect(evol_response_quality)
evol_response_quality.connect(response_quality_scorer)
response_quality_scorer.connect(expand_evolved_responses)
expand_evolved_responses.connect(generate_conversation)
generate_conversation.connect(generate_embeddings)
generate_embeddings.connect(deita_filtering)

现在我们可以运行 Pipeline。我们使用步骤名称在 Pipeline 配置中引用它们

distiset = pipeline.run(
    parameters={
        "load_data": {
            "repo_id": "distilabel-internal-testing/instruction-dataset-50",
            "split": "train",
        },
        "evol_instruction_complexity": {
            "llm": {"generation_kwargs": {"max_new_tokens": 512, "temperature": 0.7}}
        },
        "instruction_complexity_scorer": {
            "llm": {"generation_kwargs": {"temperature": 0.0}}
        },
        "evol_response_quality": {
            "llm": {"generation_kwargs": {"max_new_tokens": 512, "temperature": 0.7}}
        },
        "response_quality_scorer": {"llm": {"generation_kwargs": {"temperature": 0.0}}},
        "deita_filtering": {"data_budget": 500, "diversity_threshold": 0.04},
    },
    use_cache=False,
)

我们可以将结果推送到 Hugging Face Hub

distiset.push_to_hub("distilabel-internal-testing/deita-colab")

结果¶

同样，为了展示 EVOL QUALITY 方法的相关性，作者根据我们如何定义符合指令的优质响应，在 MT-bench 模型上评估了使用不同数据选择进行微调的模型。每次他们都根据质量分数选择了 6k 数据

致谢：Liu 等人 (2023)

当使用 EVOL QUALITY 方法选择数据时，得分比我们随机选择或根据长度（使更长的响应质量更高）选择数据时要好得多。尽管如此，我们看到我们在复杂性得分中可能看到的差距更小了。我们将在后面的部分讨论该策略。尽管如此，与基线相比，此策略似乎可以改进微调，现在我们有兴趣将质量和复杂性评估与多样性评估相结合，以找到我们选择过程中的正确权衡。

结论¶

总之，如果您正在寻找一种有效的方法，以在预算受限的情况下将开源 LLM 与您的业务用例对齐，那么 DEITA 提供的解决方案确实值得一试。这种以数据为中心的方法使人们能够专注于数据集的内容，以获得最佳结果，而不是“仅仅”使用更多且质量肯定较低的数据来扩展指令调优。简而言之，所开发的策略通过自动对指令-响应进行评分，旨在替代诸如 GPT-4 等专有模型已训练的人工偏好步骤。在如何选择好的数据方面，我们可以考虑一些改进，但这为指令调优 LLM 开辟了一条非常好的道路，降低了计算需求，使整个过程在智力上具有相关性，并且比大多数其他方法更可持续。我们很乐意帮助您根据此类方法论，将 LLM 与您的业务用例对齐。