Genstruct¶

使用 LLM 从文档中生成一对指令-响应。

Genstruct 是一个预定义的任务，旨在从给定的原始文档（包含标题和内容）生成有效的指令，从而能够从任何原始文本语料库创建新的、部分合成的指令微调数据集。该任务基于 Nous Research 的 Genstruct 7B 模型，该模型灵感来源于 Ada-Instruct 论文。

注意¶

Genstruct 提示，即任务，实际上可以与任何模型一起使用，但最安全/推荐的选择是使用 NousResearch/Genstruct-7B 作为提供给任务的 LLM，因为它专门为此任务进行了训练。

属性¶

_template：一个 Jinja2 模板，用于格式化 LLM 的输入。

输入 & 输出列¶

graph TD
    subgraph Dataset
        subgraph Columns
            ICOL0[title]
            ICOL1[content]
        end
        subgraph New columns
            OCOL0[user]
            OCOL1[assistant]
            OCOL2[model_name]
        end
    end

    subgraph Genstruct
        StepInput[Input Columns: title, content]
        StepOutput[Output Columns: user, assistant, model_name]
    end

    ICOL0 --> StepInput
    ICOL1 --> StepInput
    StepOutput --> OCOL0
    StepOutput --> OCOL1
    StepOutput --> OCOL2
    StepInput --> StepOutput

输入¶

title (str)：文档的标题。
content (str)：文档的内容。

输出¶

user (str)：基于文档的用户指令。
assistant (str)：基于用户指令的助手回复。
model_name (str)：用于生成 feedback 和 result 的模型名称。

示例¶

使用标题和内容从原始文档生成指令¶

from distilabel.steps.tasks import Genstruct
from distilabel.models import InferenceEndpointsLLM

# Consider this as a placeholder for your actual LLM.
genstruct = Genstruct(
    llm=InferenceEndpointsLLM(
        model_id="NousResearch/Genstruct-7B",
    ),
)

genstruct.load()

result = next(
    genstruct.process(
        [
            {"title": "common instruction", "content": "content of the document"},
        ]
    )
)
# result
# [
#     {
#         'title': 'An instruction',
#         'content': 'content of the document',
#         'model_name': 'test',
#         'user': 'An instruction',
#         'assistant': 'content of the document',
#     }
# ]

Genstruct¶

注意¶

属性¶

输入 & 输出列¶

输入¶

输出¶

示例¶

使用标题和内容从原始文档生成指令¶

参考¶