SelfInstruct¶

使用 LLM 基于给定输入生成指令。

SelfInstruct 是一个预定义的任务，给定一些指令、查询生成的一定标准、应用程序描述和输入，生成一些与给定输入相关的指令，并遵循查询生成标准和应用程序描述中规定的内容。它基于论文“Self-Instruct: Aligning Language Models with Self-Generated Instructions”中的 SelfInstruct 框架。

属性¶

num_instructions: 要生成的指令数量。默认为 5。
criteria_for_query_generation: 查询生成的标准。默认为论文中定义的标准。
application_description: 使用这些指令构建的 AI 应用程序的描述。默认为 AI 助手。

输入和输出列¶

graph TD
    subgraph Dataset
        subgraph Columns
            ICOL0[input]
        end
        subgraph New columns
            OCOL0[instructions]
            OCOL1[model_name]
        end
    end

    subgraph SelfInstruct
        StepInput[Input Columns: input]
        StepOutput[Output Columns: instructions, model_name]
    end

    ICOL0 --> StepInput
    StepOutput --> OCOL0
    StepOutput --> OCOL1
    StepInput --> StepOutput

输入¶

input (str): 生成指令的输入。在论文中也称为 seed。

输出¶

instructions (List[str]): 生成的指令。
model_name (str): 用于生成指令的模型名称。

示例¶

基于给定输入生成指令¶

from distilabel.steps.tasks import SelfInstruct
from distilabel.models import InferenceEndpointsLLM

self_instruct = SelfInstruct(
    llm=InferenceEndpointsLLM(
        model_id="mistralai/Mistral-7B-Instruct-v0.2",
    ),
    num_instructions=5,  # This is the default value
)

self_instruct.load()

result = next(self_instruct.process([{"input": "instruction"}]))
# result
# [
#     {
#         'input': 'instruction',
#         'model_name': 'mistralai/Mistral-7B-Instruct-v0.2',
#         'instructions': ["instruction 1", "instruction 2", "instruction 3", "instruction 4", "instruction 5"],
#     }
# ]