MagpieGenerator¶
使用 Magpie 生成指令或对话的生成器任务。
Magpie 是一种巧妙的方法,它允许在没有种子数据或特定系统提示的情况下生成用户指令,这归功于指令微调 LLM 的自回归能力。由于它们是使用由用户消息和期望的助手输出组成的聊天模板进行微调的,因此指令微调的 LLM 学习到在预查询或预指令令牌之后是指令。如果将这些预查询令牌发送到 LLM 而没有任何用户消息,则 LLM 将继续生成令牌,就像它是用户一样。此技巧允许从指令微调的 LLM 中“提取”指令。在此指令生成后,可以再次将其发送到 LLM 以生成本次助手响应。此过程可以重复 N 次,从而构建多轮对话。此方法在论文“Magpie:通过使用无提示对齐的 LLM 从头开始合成对齐数据”中进行了描述。
属性¶
-
n_turns: 生成的对话将具有的轮数。默认为
1
。 -
end_with_user: 对话是否应以用户消息结束。默认为
False
。 -
include_system_prompt: 是否包含在生成的对话中使用的系统提示。默认为
False
。 -
only_instruction: 是否仅生成指令。如果此参数为
True
,则将忽略n_turns
。默认为False
。 -
system_prompt: 可选的系统提示,或从中随机选择一个系统提示的系统提示列表,或从中随机选择一个系统提示的系统提示字典,或具有其被选择概率的系统提示字典。随机系统提示将按输入/输出批次选择。此系统提示可用于指导指令 LLM 的生成,并引导其生成特定主题的指令。默认为
None
。 -
num_rows: 要生成的行数。
运行时参数¶
-
n_turns: 生成的对话将具有的轮数。默认为
1
。 -
end_with_user: 对话是否应以用户消息结束。默认为
False
。 -
include_system_prompt: 是否包含在生成的对话中使用的系统提示。默认为
False
。 -
only_instruction: 是否仅生成指令。如果此参数为
True
,则将忽略n_turns
。默认为False
。 -
system_prompt: 可选的系统提示,或从中随机选择一个系统提示的系统提示列表,或从中随机选择一个系统提示的系统提示字典,或具有其被选择概率的系统提示字典。随机系统提示将按输入/输出批次选择。此系统提示可用于指导指令 LLM 的生成,并引导其生成特定主题的指令。
-
num_rows: 要生成的行数。
输入 & 输出列¶
graph TD
subgraph Dataset
subgraph New columns
OCOL0[conversation]
OCOL1[instruction]
OCOL2[response]
OCOL3[system_prompt_key]
OCOL4[model_name]
end
end
subgraph MagpieGenerator
StepOutput[Output Columns: conversation, instruction, response, system_prompt_key, model_name]
end
StepOutput --> OCOL0
StepOutput --> OCOL1
StepOutput --> OCOL2
StepOutput --> OCOL3
StepOutput --> OCOL4
输出¶
-
conversation (
ChatType
): 生成的对话,它是包含角色和消息的聊天项目列表。 -
instruction (
str
): 如果only_instruction=True
,则为生成的指令。 -
response (
str
): 如果n_turns==1
,则为生成的回复。 -
system_prompt_key (
str
, 可选): 用于生成对话或指令的系统提示的键。仅当system_prompt
是字典时。 -
model_name (
str
): 用于生成conversation
或instruction
的模型名称。
示例¶
使用 Llama 3 8B Instruct 和 TransformersLLM 生成指令¶
from distilabel.models import TransformersLLM
from distilabel.steps.tasks import MagpieGenerator
generator = MagpieGenerator(
llm=TransformersLLM(
model="meta-llama/Meta-Llama-3-8B-Instruct",
magpie_pre_query_template="llama3",
generation_kwargs={
"temperature": 1.0,
"max_new_tokens": 256,
},
device="mps",
),
only_instruction=True,
num_rows=5,
)
generator.load()
result = next(generator.process())
# (
# [
# {"instruction": "I've just bought a new phone and I're excited to start using it."},
# {"instruction": "What are the most common types of companies that use digital signage?"}
# ],
# True
# )
使用 Llama 3 8B Instruct 和 TransformersLLM 生成对话¶
from distilabel.models import TransformersLLM
from distilabel.steps.tasks import MagpieGenerator
generator = MagpieGenerator(
llm=TransformersLLM(
model="meta-llama/Meta-Llama-3-8B-Instruct",
magpie_pre_query_template="llama3",
generation_kwargs={
"temperature": 1.0,
"max_new_tokens": 64,
},
device="mps",
),
n_turns=3,
num_rows=5,
)
generator.load()
result = next(generator.process())
# (
# [
# {
# 'conversation': [
# {
# 'role': 'system',
# 'content': 'You are a helpful Al assistant. The user will engage in a multi−round conversation with you,asking initial questions and following up with additional related questions. Your goal is to provide thorough, relevant and
# insightful responses to help the user with their queries.'
# },
# {'role': 'user', 'content': "I'm considering starting a social media campaign for my small business and I're not sure where to start. Can you help?"},
# {
# 'role': 'assistant',
# 'content': "Exciting endeavor! Creating a social media campaign can be a great way to increase brand awareness, drive website traffic, and ultimately boost sales. I'd be happy to guide you through the process. To get started,
# let's break down the basics. First, we need to identify your goals and target audience. What do"
# },
# {
# 'role': 'user',
# 'content': "Before I start a social media campaign, what kind of costs ammol should I expect to pay? There are several factors that contribute to the total cost of running a social media campaign. Let me outline some of the main
# expenses you might encounter: 1. Time: As the business owner, you'll likely spend time creating"
# },
# {
# 'role': 'assistant',
# 'content': 'Time is indeed one of the biggest investments when it comes to running a social media campaign! Besides time, you may also incur costs associated with: 2. Content creation: You might need to hire freelancers or
# agencies to create high-quality content (images, videos, captions) for your social media platforms. 3. Advertising'
# }
# ]
# },
# {
# 'conversation': [
# {
# 'role': 'system',
# 'content': 'You are a helpful Al assistant. The user will engage in a multi−round conversation with you,asking initial questions and following up with additional related questions. Your goal is to provide thorough, relevant and
# insightful responses to help the user with their queries.'
# },
# {'role': 'user', 'content': "I am thinking of buying a new laptop or computer. What are some important factors I should consider when making your decision? I'll make sure to let you know if any other favorites or needs come up!"},
# {
# 'role': 'assistant',
# 'content': 'Exciting times ahead! When considering a new laptop or computer, there are several key factors to think about to ensure you find the right one for your needs. Here are some crucial ones to get you started: 1.
# **Purpose**: How will you use your laptop or computer? For work, gaming, video editing,'
# },
# {
# 'role': 'user',
# 'content': 'Let me stop you there. Let's explore this "purpose" factor that you mentioned earlier. Can you elaborate more on what type of devices would be suitable for different purposes? For example, if I're primarily using my
# laptop for general usage like browsing, email, and word processing, would a budget-friendly laptop be sufficient'
# },
# {
# 'role': 'assistant',
# 'content': "Understanding your purpose can greatly impact the type of device you'll need. **General Usage (Browsing, Email, Word Processing)**: For casual users who mainly use their laptop for daily tasks, a budget-friendly
# option can be sufficient. Look for laptops with: * Intel Core i3 or i5 processor* "
# }
# ]
# }
# ],
# True
# )
使用带概率的系统提示生成¶
from distilabel.models import InferenceEndpointsLLM
from distilabel.steps.tasks import MagpieGenerator
magpie = MagpieGenerator(
llm=InferenceEndpointsLLM(
model_id="meta-llama/Meta-Llama-3-8B-Instruct",
tokenizer_id="meta-llama/Meta-Llama-3-8B-Instruct",
magpie_pre_query_template="llama3",
generation_kwargs={
"temperature": 0.8,
"max_new_tokens": 256,
},
),
n_turns=2,
system_prompt={
"math": ("You're an expert AI assistant.", 0.8),
"writing": ("You're an expert writing assistant.", 0.2),
},
)
magpie.load()
result = next(magpie.process())