结构化数据生成¶

Distilabel 集成了相关的库以生成结构化文本，即引导 LLM 按照 JSON 模式、正则表达式等生成结构化输出。

Outlines¶

Distilabel 在一些 LLM 子类中集成了 outlines。目前，在 distilabel 中支持与 outlines 集成的以下 LLM：TransformersLLM、vLLM 或 LlamaCppLLM，以便任何人都可以生成 JSON 格式或可解析 正则表达式 的结构化输出。

LLM 有一个名为 structured_output¹ 的参数，用于确定我们如何使用它生成结构化输出，让我们看一个使用 LlamaCppLLM 的示例。

注意

为了使 outlines 集成工作，您可能需要安装相应的依赖项

pip install distilabel[outlines]

JSON¶

我们将从 JSON 示例开始，其中我们首先定义一个 pydantic.BaseModel 模式来指导结构化输出的生成。

注意

查看 StructuredOutputType 以查看 structured_output 字典变量的预期格式。

from pydantic import BaseModel

class User(BaseModel):
    name: str
    last_name: str
    id: int

然后我们将该模式提供给 LLM 的 structured_output 参数。

from distilabel.models import LlamaCppLLM

llm = LlamaCppLLM(
    model_path="./openhermes-2.5-mistral-7b.Q4_K_M.gguf"  # (1)
    n_gpu_layers=-1,
    n_ctx=1024,
    structured_output={"format": "json", "schema": User},
)
llm.load()

我们之前已经使用 curl² 从 Hugging Face Hub 下载了 GGUF 模型，即 llama.cpp 兼容模型，但任何模型都可以用作替代品，只要更新 model_path 参数即可。

我们已准备好像往常一样传递我们的指令

import json

result = llm.generate(
    [[{"role": "user", "content": "Create a user profile for the following marathon"}]],
    max_new_tokens=50
)

data = json.loads(result[0][0])
data
# {'name': 'Kathy', 'last_name': 'Smith', 'id': 4539210}
User(**data)
# User(name='Kathy', last_name='Smith', id=4539210)

我们取回一个 Python 字典（格式化为字符串），我们可以使用 json.loads 解析它，或者直接使用 User 验证它，User 是 pydantic.BaseModel 实例。

Regex¶

以下示例显示了文本生成的示例，其输出符合正则表达式

pattern = r"<name>(.*?)</name>.*?<grade>(.*?)</grade>"  # the same pattern for re.compile

llm=LlamaCppLLM(
    model_path=model_path,
    n_gpu_layers=-1,
    n_ctx=1024,
    structured_output={"format": "regex", "schema": pattern},
)
llm.load()

result = llm.generate(
    [
        [
            {"role": "system", "content": "You are Simpsons' fans who loves assigning grades from A to E, where A is the best and E is the worst."},
            {"role": "user", "content": "What's up with Homer Simpson?"}
        ]
    ],
    max_new_tokens=200
)

我们可以通过使用我们从 LLM 要求的相同模式解析内容来检查输出。

import re
match = re.search(pattern, result[0][0])

if match:
    name = match.group(1)
    grade = match.group(2)
    print(f"Name: {name}, Grade: {grade}")
# Name: Homer Simpson, Grade: C+

这些是一些简单的示例，但可以看出这打开了哪些选项。

提示

完整的 pipeline 示例可以在以下脚本中看到：examples/structured_generation_with_outlines.py

Instructor¶

对于 API 背后的其他 LLM 提供商，没有像 outlines 那样直接访问内部 logit 处理器的直接方法，但由于 instructor，我们可以从基于 pydantic.BaseModel 对象的 LLM 提供商生成结构化输出。我们集成了 instructor 来处理 AsyncLLM。

注意

为了使 instructor 集成工作，您可能需要安装相应的依赖项

pip install distilabel[instructor]

注意

查看 InstructorStructuredOutputType 以查看 structured_output 字典变量的预期格式。

以下是您可以在 outlines 的 JSON 部分看到的相同示例，用于比较目的。

from pydantic import BaseModel

class User(BaseModel):
    name: str
    last_name: str
    id: int

然后我们将该模式提供给 LLM 的 structured_output 参数

注意

在此示例中，我们使用 Meta Llama 3.1 8B Instruct，请记住并非所有模型都支持结构化输出。

from distilabel.models import MistralLLM

llm = InferenceEndpointsLLM(
    model_id="meta-llama/Meta-Llama-3.1-8B-Instruct",
    tokenizer_id="meta-llama/Meta-Llama-3.1-8B-Instruct",
    structured_output={"schema": User}
)
llm.load()

我们已准备好像往常一样传递我们的指令

import json

result = llm.generate(
    [[{"role": "user", "content": "Create a user profile for the following marathon"}]],
    max_new_tokens=256
)

data = json.loads(result[0][0])
data
# {'name': 'John', 'last_name': 'Doe', 'id': 12345}
User(**data)
# User(name='John', last_name='Doe', id=12345)

我们取回一个 Python 字典（格式化为字符串），我们可以使用 json.loads 解析它，或者直接使用 User 验证它，User 是 pydantic.BaseModel 实例。

提示

完整的 pipeline 示例可以在以下脚本中看到：examples/structured_generation_with_instructor.py

OpenAI JSON¶

OpenAI 提供了一个 JSON 模式，通过其 API 处理结构化输出，让我们看看如何使用它们。JSON 模式指示模型始终返回遵循所需指令的 JSON 对象。

警告

请记住，为了使此功能正常工作，您必须以某种方式指示模型生成 JSON，无论是在 system message 中还是在指令中，如 API 参考中所示。

与我们通过 outlines 获得的内容相反，JSON 模式不保证输出与任何特定模式匹配，仅保证它是有效的并且可以无错误地解析。更多信息可以在 OpenAI 文档中找到。

除了引用生成 JSON 之外，为了确保模型生成可解析的 JSON，我们可以传递参数 response_format="json"³

from distilabel.models import OpenAILLM
llm = OpenAILLM(model="gpt4-turbo", api_key="api.key")
llm.generate(..., response_format="json")

您可以通过从以下位置导入来检查变量类型

from distilabel.steps.tasks.structured_outputs.outlines import StructuredOutputType

↩

使用 curl 下载模型

curl -L -o ~/Downloads/openhermes-2.5-mistral-7b.Q4_K_M.gguf https://hugging-face.cn/TheBloke/OpenHermes-2.5-Mistral-7B-GGUF/resolve/main/openhermes-2.5-mistral-7b.Q4_K_M.gguf

↩

请记住，要在 pipeline 中与此 response_format 参数交互，您必须通过 generation_kwargs 传递它

# Assuming a pipeline is already defined, and we have a task using OpenAILLM called `task_with_openai`:
pipeline.run(
    parameters={
        "task_with_openai": {
            "llm": {
                "generation_kwargs": {
                    "response_format": "json"
                }
            }
        }
    }
)

↩