使用 LLM 生成和判断的任务¶

使用任务¶

Task 是一种特殊的 Step，它将 LLM 作为强制参数。与 Step 一样，它通常在 Pipeline 中使用，但也可以独立使用。

例如，最基本的任务是 TextGeneration 任务，它根据给定的指令生成文本。

from distilabel.models import InferenceEndpointsLLM
from distilabel.steps.tasks import TextGeneration

task = TextGeneration(
    name="text-generation",
    llm=InferenceEndpointsLLM(
        model_id="meta-llama/Meta-Llama-3-70B-Instruct",
        tokenizer_id="meta-llama/Meta-Llama-3-70B-Instruct",
    ),
)
task.load()

next(task.process([{"instruction": "What's the capital of Spain?"}]))
# [
#   {
#     "instruction": "What's the capital of Spain?",
#     "generation": "The capital of Spain is Madrid.",
#     "distilabel_metadata": {
#       "raw_output_text-generation": "The capital of Spain is Madrid.",
#       "raw_input_text-generation": [
#         {
#           "role": "user",
#           "content": "What's the capital of Spain?"
#         }
#       ],
#       "statistics_text-generation": {  # (1)
#         "input_tokens": 18,
#         "output_tokens": 8
#       }
#     },
#     "model_name": "meta-llama/Meta-Llama-3.1-8B-Instruct"
#   }
# ]

LLMs 不仅会返回文本，还会返回一个 statistics_{STEP_NAME} 字段，其中包含与生成相关的统计信息。如果可用，至少会返回输入和输出 token。

注意

当 Step.load() 作为独立步骤使用时，始终需要执行。在 pipeline 中，这将在 pipeline 执行期间自动完成。

如上所示，TextGeneration 任务基于 instruction 添加了一个 generation。

1.2.0 版本新增

自 1.2.0 版本起，我们通过 distilabel_metadata 提供有关 LLM 调用的一些元数据。可以通过在创建任务时将 add_raw_output 属性设置为 False 来禁用此功能。

此外，自 1.4.0 版本起，还可以包含格式化的输入，这在测试自定义模板时非常有用（使用 dry_run 方法测试 pipeline）。

禁用原始输入和输出

task = TextGeneration(
    llm=InferenceEndpointsLLM(
        model_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
        tokenizer_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
    ),
    add_raw_output=False,
    add_raw_input=False
)

1.5.0 版本新增

自 1.5.0 版本起，distilabel_metadata 开箱即用地包含了一个新的 statistics 字段。来自 LLM 的生成不仅包含文本，还包含与文本相关的统计信息（如果可用），例如输入和输出 token。此字段将使用 statistic_{STEP_NAME} 生成，以避免 pipeline 中不同步骤之间的冲突，类似于 raw_output_{STEP_NAME} 的工作方式。

Task.print¶

1.4.0 版本新增

自 1.4.0 版本起新增了 Task.print Task.print 方法。

Tasks 包含一个方便的方法来显示为 LLM 格式化的 prompt 的外观，让我们看一个 UltraFeedback 的示例，但这适用于任何其他 Task。

from distilabel.steps.tasks import UltraFeedback
from distilabel.models import InferenceEndpointsLLM

uf = UltraFeedback(
    llm=InferenceEndpointsLLM(
        model_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
    ),
)
uf.load()
uf.print()

结果将是一个渲染后的 prompt，包含系统 prompt（如果任务包含）和用户 prompt，以富文本形式渲染（在 jupyter notebook 中显示的效果完全相同）。

如果您想使用自定义输入进行测试，可以将示例传递给任务的 format_input 方法（或根据任务自行生成），然后将其传递给 print 方法，以便显示您的示例

uf.print(
    uf.format_input({"instruction": "test", "generations": ["1", "2"]})
)

使用 DummyLLM 避免加载 LLM

如果您不想加载 LLM 来渲染模板，可以创建一个虚拟 LLM，就像我们用于测试的那些一样。

from distilabel.models import LLM
from distilabel.models.mixins import MagpieChatTemplateMixin

class DummyLLM(AsyncLLM, MagpieChatTemplateMixin):
    structured_output: Any = None
    magpie_pre_query_template: str = "llama3"

    def load(self) -> None:
        pass

    @property
    def model_name(self) -> str:
        return "test"

    def generate(
        self, input: "FormattedInput", num_generations: int = 1
    ) -> "GenerateOutput":
        return ["output" for _ in range(num_generations)]

您可以像使用任何其他 LLM 一样使用此 LLM 来 load 您的任务并调用 print

uf = UltraFeedback(llm=DummyLLM())
uf.load()
uf.print()

注意

在创建自定义任务时，print 方法默认可用，但它仅限于输入最常见的场景。如果您测试您的新任务并发现它没有按预期工作（例如，如果您的任务包含一个由文本列表而不是单个文本组成的输入），您应该覆盖 _sample_input 方法。您可以查看 UltraFeedback 源代码了解更多信息。

指定生成数量和分组生成¶

所有 Task 都有一个 num_generations 属性，允许定义我们希望每个输入生成的数量。我们可以更新上面的示例，为每个输入生成 3 个补全

from distilabel.models import InferenceEndpointsLLM
from distilabel.steps.tasks import TextGeneration

task = TextGeneration(
    name="text-generation",
    llm=InferenceEndpointsLLM(
        model_id="meta-llama/Meta-Llama-3-70B-Instruct",
        tokenizer_id="meta-llama/Meta-Llama-3-70B-Instruct",
    ),
    num_generations=3,
)
task.load()

next(task.process([{"instruction": "What's the capital of Spain?"}]))
# [
#     {
#         'instruction': "What's the capital of Spain?",
#         'generation': 'The capital of Spain is Madrid.',
#         'distilabel_metadata': {'raw_output_text-generation': 'The capital of Spain is Madrid.'},
#         'model_name': 'meta-llama/Meta-Llama-3-70B-Instruct'
#     },
#     {
#         'instruction': "What's the capital of Spain?",
#         'generation': 'The capital of Spain is Madrid.',
#         'distilabel_metadata': {'raw_output_text-generation': 'The capital of Spain is Madrid.'},
#         'model_name': 'meta-llama/Meta-Llama-3-70B-Instruct'
#     },
#     {
#         'instruction': "What's the capital of Spain?",
#         'generation': 'The capital of Spain is Madrid.',
#         'distilabel_metadata': {'raw_output_text-generation': 'The capital of Spain is Madrid.'},
#         'model_name': 'meta-llama/Meta-Llama-3-70B-Instruct'
#     }
# ]

此外，我们可能希望将生成结果分组到一个输出行中，因为下游步骤可能期望一个包含多个生成结果的单行。我们可以通过将 group_generations 属性设置为 True 来实现这一点

from distilabel.models import InferenceEndpointsLLM
from distilabel.steps.tasks import TextGeneration

task = TextGeneration(
    name="text-generation",
    llm=InferenceEndpointsLLM(
        model_id="meta-llama/Meta-Llama-3-70B-Instruct",
        tokenizer_id="meta-llama/Meta-Llama-3-70B-Instruct",
    ),
    num_generations=3,
    group_generations=True
)
task.load()

next(task.process([{"instruction": "What's the capital of Spain?"}]))
# [
#     {
#         'instruction': "What's the capital of Spain?",
#         'generation': ['The capital of Spain is Madrid.', 'The capital of Spain is Madrid.', 'The capital of Spain is Madrid.'],
#         'distilabel_metadata': [
#             {'raw_output_text-generation': 'The capital of Spain is Madrid.'},
#             {'raw_output_text-generation': 'The capital of Spain is Madrid.'},
#             {'raw_output_text-generation': 'The capital of Spain is Madrid.'}
#         ],
#         'model_name': 'meta-llama/Meta-Llama-3-70B-Instruct'
#     }
# ]

定义自定义任务¶

我们可以通过创建 Task 的新子类并定义以下内容来定义自定义步骤

inputs：是一个属性，它返回一个字符串列表，其中包含所需输入字段的名称；或者返回一个字典，其中键是列的名称，值是布尔值，指示该列是否为必需。
format_input：是一个方法，它接收包含输入数据的字典，并返回遵循 chat-completion OpenAI 消息格式的 ChatType。
outputs：是一个属性，它返回一个字符串列表，其中包含输出字段的名称；或者返回一个字典，其中键是列的名称，值是布尔值，指示该列是否为必需。此属性应始终包含 model_name 作为输出之一，因为它是从 LLM 自动注入的。
format_output：是一个方法，它接收来自 LLM 的输出，还可以选择接收输入数据（这在某些场景下可能对构建输出很有用），并返回一个字典，其中包含根据需要格式化的输出数据，即 outputs 中列的值。请注意，无需在输出中包含 model_name。

继承自 Task使用 @task 装饰器

当使用 Task 类继承方法创建自定义任务时，我们还可以选择覆盖 Task.process 方法来定义涉及 LLM 的更复杂的处理逻辑，因为默认方法只是在先前格式化输入和随后格式化输出后调用 LLM.generate 方法一次。例如，EvolInstruct 任务覆盖此方法以多次调用 LLM.generate（每次进化调用一次）。

from typing import Any, Dict, List, Union, TYPE_CHECKING

from distilabel.steps.tasks import Task

if TYPE_CHECKING:
    from distilabel.typing import StepColumns, ChatType


class MyCustomTask(Task):
    @property
    def inputs(self) -> "StepColumns":
        return ["input_field"]

    def format_input(self, input: Dict[str, Any]) -> "ChatType":
        return [
            {
                "role": "user",
                "content": input["input_field"],
            },
        ]

    @property
    def outputs(self) -> "StepColumns":
        return ["output_field", "model_name"]

    def format_output(
        self, output: Union[str, None], input: Dict[str, Any]
    ) -> Dict[str, Any]:
        return {"output_field": output}

如果您的任务只需要系统 prompt、用户消息模板和一种格式化 LLM 给出的输出的方法，那么您可以使用 @task 装饰器来避免编写过多样板代码。

from typing import Any, Dict, Union
from distilabel.steps.tasks import task


@task(inputs=["input_field"], outputs=["output_field"])
def MyCustomTask(output: Union[str, None], input: Union[Dict[str, Any], None] = None) -> Dict[str, Any]:
    """
    ---
    system_prompt: |
        My custom system prompt

    user_message_template: |
        My custom user message template: {input_field}
    ---
    """
    # Format the `LLM` output here
    return {"output_field": output}

警告

大多数 Tasks 重用 Task.process 方法来处理生成结果，但如果新的 Task 定义了自定义 process 方法，例如 Magpie 就是这种情况，则必须处理 LLM 返回的 statistics。