跳到内容

TextGeneration

使用 LLM 根据提示生成文本。

TextGeneration 是一个预定义的任务,允许使用 Jinja2 语法传递自定义提示。默认情况下,输入中需要一个 instruction,但使用 templatecolumns 属性可以定义自定义提示和文本中期望的列。此任务应该足以满足不需要对 LLM 生成的响应进行后处理的任务。

属性

  • system_prompt: 生成中使用的系统提示。如果未提供,则将检查输入行是否有名为 system_prompt 的列并使用它。如果否,则不使用系统提示。默认为 None

  • template: 生成中使用的模板。它必须遵循 Jinja2 模板语法。如果未提供,它将假定传递的文本是指令并构建适当的模板。

  • columns: 模板中期望的列的字符串或列的列表。请查看示例以获取更多信息。默认为 instruction

  • use_system_prompt: 已弃用。将在 1.5.0 版本中删除。是否在生成中使用系统提示。默认为 True,这意味着如果在输入批次中定义了列 system_prompt,则将使用 system_prompt,否则将被忽略。

输入和输出列

graph TD
    subgraph Dataset
        subgraph Columns
            ICOL0[dynamic]
        end
        subgraph New columns
            OCOL0[generation]
            OCOL1[model_name]
        end
    end

    subgraph TextGeneration
        StepInput[Input Columns: dynamic]
        StepOutput[Output Columns: generation, model_name]
    end

    ICOL0 --> StepInput
    StepOutput --> OCOL0
    StepOutput --> OCOL1
    StepInput --> StepOutput

输入

  • dynamic (由 columns 属性确定): 默认情况下将设置为 instruction。列可以指向 strList[str],用于模板中。

输出

  • generation (str): 生成的文本。

  • model_name (str): 用于生成文本的模型的名称。

示例

从指令生成文本

from distilabel.steps.tasks import TextGeneration
from distilabel.models import InferenceEndpointsLLM

# Consider this as a placeholder for your actual LLM.
text_gen = TextGeneration(
    llm=InferenceEndpointsLLM(
        model_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
    )
)

text_gen.load()

result = next(
    text_gen.process(
        [{"instruction": "your instruction"}]
    )
)
# result
# [
#     {
#         'instruction': 'your instruction',
#         'model_name': 'meta-llama/Meta-Llama-3.1-70B-Instruct',
#         'generation': 'generation',
#     }
# ]

使用自定义模板生成文本

from distilabel.steps.tasks import TextGeneration
from distilabel.models import InferenceEndpointsLLM

CUSTOM_TEMPLATE = '''Document:
{{ document }}

Question: {{ question }}

Please provide a clear and concise answer to the question based on the information in the document and your general knowledge:
'''.rstrip()

text_gen = TextGeneration(
    llm=InferenceEndpointsLLM(
        model_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
    ),
    system_prompt="You are a helpful AI assistant. Your task is to answer the following question based on the provided document. If the answer is not explicitly stated in the document, use your knowledge to provide the most relevant and accurate answer possible. If you cannot answer the question based on the given information, state that clearly.",
    template=CUSTOM_TEMPLATE,
    columns=["document", "question"],
)

text_gen.load()

result = next(
    text_gen.process(
        [
            {
                "document": "The Great Barrier Reef, located off the coast of Australia, is the world's largest coral reef system. It stretches over 2,300 kilometers and is home to a diverse array of marine life, including over 1,500 species of fish. However, in recent years, the reef has faced significant challenges due to climate change, with rising sea temperatures causing coral bleaching events.",
                "question": "What is the main threat to the Great Barrier Reef mentioned in the document?"
            }
        ]
    )
)
# result
# [
#     {
#         'document': 'The Great Barrier Reef, located off the coast of Australia, is the world's largest coral reef system. It stretches over 2,300 kilometers and is home to a diverse array of marine life, including over 1,500 species of fish. However, in recent years, the reef has faced significant challenges due to climate change, with rising sea temperatures causing coral bleaching events.',
#         'question': 'What is the main threat to the Great Barrier Reef mentioned in the document?',
#         'model_name': 'meta-llama/Meta-Llama-3.1-70B-Instruct',
#         'generation': 'According to the document, the main threat to the Great Barrier Reef is climate change, specifically rising sea temperatures causing coral bleaching events.',
#     }
# ]

使用不同系统提示的少样本学习

from distilabel.steps.tasks import TextGeneration
from distilabel.models import InferenceEndpointsLLM

CUSTOM_TEMPLATE = '''Generate a clear, single-sentence instruction based on the following examples:

{% for example in examples %}
Example {{ loop.index }}:
Instruction: {{ example }}

{% endfor %}
Now, generate a new instruction in a similar style:
'''.rstrip()

text_gen = TextGeneration(
    llm=InferenceEndpointsLLM(
        model_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
    ),
    template=CUSTOM_TEMPLATE,
    columns="examples",
)

text_gen.load()

result = next(
    text_gen.process(
        [
            {
                "examples": ["This is an example", "Another relevant example"],
                "system_prompt": "You are an AI assistant specialised in cybersecurity and computing in general, you make your point clear without any explanations."
            }
        ]
    )
)
# result
# [
#     {
#         'examples': ['This is an example', 'Another relevant example'],
#         'system_prompt': 'You are an AI assistant specialised in cybersecurity and computing in general, you make your point clear without any explanations.',
#         'model_name': 'meta-llama/Meta-Llama-3.1-70B-Instruct',
#         'generation': 'Disable the firewall on the router',
#     }
# ]

参考