StructuredGeneration¶

使用 LLM 为给定的 instruction 生成结构化内容。

StructuredGeneration 是一个预定义的任务，它将 instruction 和 structured_output 定义为输入，并将 generation 定义为输出。此任务用于根据输入指令并按照每个 instruction 的 structured_output 列中提供的 schema 生成结构化内容。model_name 也作为输出的一部分返回，以便增强它。

属性¶

use_system_prompt: 是否在生成中使用系统提示。默认为 True，这意味着如果在输入批次中定义了 system_prompt 列，则将使用 system_prompt，否则将被忽略。

输入和输出列¶

graph TD
    subgraph Dataset
        subgraph Columns
            ICOL0[instruction]
            ICOL1[structured_output]
        end
        subgraph New columns
            OCOL0[generation]
            OCOL1[model_name]
        end
    end

    subgraph StructuredGeneration
        StepInput[Input Columns: instruction, structured_output]
        StepOutput[Output Columns: generation, model_name]
    end

    ICOL0 --> StepInput
    ICOL1 --> StepInput
    StepOutput --> OCOL0
    StepOutput --> OCOL1
    StepInput --> StepOutput

输入¶

instruction (str): 从中生成结构化内容的指令。
structured_output (Dict[str, Any]): 从中生成结构化内容的 structured_output。它应该是一个 Python 字典，其中键为 format 和 schema，其中 format 应该是 json 或 regex 之一，而 schema 应该是 JSON schema 或 regex 模式，分别。

输出¶

generation (str): 生成的文本，如果可能，与提供的 schema 匹配。
model_name (str): 用于生成文本的模型的名称。

示例¶

从 JSON schema 生成结构化输出¶

from distilabel.steps.tasks import StructuredGeneration
from distilabel.models import InferenceEndpointsLLM

structured_gen = StructuredGeneration(
    llm=InferenceEndpointsLLM(
        model_id="meta-llama/Meta-Llama-3-70B-Instruct",
        tokenizer_id="meta-llama/Meta-Llama-3-70B-Instruct",
    ),
)

structured_gen.load()

result = next(
    structured_gen.process(
        [
            {
                "instruction": "Create an RPG character",
                "structured_output": {
                    "format": "json",
                    "schema": {
                        "properties": {
                            "name": {
                                "title": "Name",
                                "type": "string"
                            },
                            "description": {
                                "title": "Description",
                                "type": "string"
                            },
                            "role": {
                                "title": "Role",
                                "type": "string"
                            },
                            "weapon": {
                                "title": "Weapon",
                                "type": "string"
                            }
                        },
                        "required": [
                            "name",
                            "description",
                            "role",
                            "weapon"
                        ],
                        "title": "Character",
                        "type": "object"
                    }
                },
            }
        ]
    )
)

从 regex 模式生成结构化输出（仅适用于支持 regex 的 LLM，使用 outlines 的提供程序）¶

from distilabel.steps.tasks import StructuredGeneration
from distilabel.models import InferenceEndpointsLLM

structured_gen = StructuredGeneration(
    llm=InferenceEndpointsLLM(
        model_id="meta-llama/Meta-Llama-3-70B-Instruct",
        tokenizer_id="meta-llama/Meta-Llama-3-70B-Instruct",
    ),
)

structured_gen.load()

result = next(
    structured_gen.process(
        [
            {
                "instruction": "What's the weather like today in Seattle in Celsius degrees?",
                "structured_output": {
                    "format": "regex",
                    "schema": r"(\d{1,2})°C"
                },

            }
        ]
    )
)