APIGenGenerator¶
以 JSON 格式为给定函数生成查询和答案。
APIGenGenerator
的灵感来自 APIGen pipeline,它旨在生成可验证和多样化的函数调用数据集。此任务为给定函数生成一组多样化的查询和相应的 JSON 格式的答案。
属性¶
-
system_prompt: 引导用户生成查询和答案的系统提示。
-
use_tools: 是否使用提示中可用的工具来生成查询和答案。如果输入中给出了工具,它们将被添加到提示中。
-
number: 要生成的查询数量。它可以是一个列表,其中每个数字将被随机选择,或者是一个字典,其中包含查询数量和每个数量的概率。例如:
number=1
,number=[1, 2, 3]
,number={1: 0.5, 2: 0.3, 3: 0.2}
都是有效的输入。它对应于要并行生成的查询数量。 -
use_default_structured_output: 是否使用默认的结构化输出。
输入和输出列¶
graph TD
subgraph Dataset
subgraph Columns
ICOL0[examples]
ICOL1[func_name]
ICOL2[func_desc]
ICOL3[tools]
end
subgraph New columns
OCOL0[query]
OCOL1[answers]
end
end
subgraph APIGenGenerator
StepInput[Input Columns: examples, func_name, func_desc, tools]
StepOutput[Output Columns: query, answers]
end
ICOL0 --> StepInput
ICOL1 --> StepInput
ICOL2 --> StepInput
ICOL3 --> StepInput
StepOutput --> OCOL0
StepOutput --> OCOL1
StepInput --> StepOutput
输入¶
-
examples (
str
): 用作少量示例来指导模型的示例。 -
func_name (
str
): 要生成的功能的名称。 -
func_desc (
str
): 功能应执行的操作的描述。 -
tools (
str
): 包含功能工具表示的 JSON 格式字符串。
输出¶
-
query (
str
): 查询列表。 -
answers (
str
): JSON 格式字符串,其中包含答案列表,其中包含要传递给函数的字典形式的信息。
示例¶
生成不带结构化输出的内容(原始实现)¶
from distilabel.steps.tasks import ApiGenGenerator
from distilabel.models import InferenceEndpointsLLM
llm=InferenceEndpointsLLM(
model_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
generation_kwargs={
"temperature": 0.7,
"max_new_tokens": 1024,
},
)
apigen = ApiGenGenerator(
use_default_structured_output=False,
llm=llm
)
apigen.load()
res = next(
apigen.process(
[
{
"examples": 'QUERY:
What is the binary sum of 10010 and 11101?
ANSWER:
[{"name": "binary_addition", "arguments": {"a": "10010", "b": "11101"}}]',
"func_name": "getrandommovie",
"func_desc": "Returns a list of random movies from a database by calling an external API."
}
]
)
)
res
# [{'examples': 'QUERY:
What is the binary sum of 10010 and 11101?
ANSWER:
[{"name": "binary_addition", "arguments": {"a": "10010", "b": "11101"}}]',
# 'number': 1,
# 'func_name': 'getrandommovie',
# 'func_desc': 'Returns a list of random movies from a database by calling an external API.',
# 'queries': ['I want to watch a movie tonight, can you recommend a random one from your database?',
# 'Give me 5 random movie suggestions from your database to plan my weekend.'],
# 'answers': [[{'name': 'getrandommovie', 'arguments': {}}],
# [{'name': 'getrandommovie', 'arguments': {}},
# {'name': 'getrandommovie', 'arguments': {}},
# {'name': 'getrandommovie', 'arguments': {}},
# {'name': 'getrandommovie', 'arguments': {}},
# {'name': 'getrandommovie', 'arguments': {}}]],
# 'raw_input_api_gen_generator_0': [{'role': 'system',
# 'content': "You are a data labeler. Your responsibility is to generate a set of diverse queries and corresponding answers for the given functions in JSON format.
Construct queries and answers that exemplify how to use these functions in a practical scenario. Include in each query specific, plausible values for each parameter. For instance, if the function requires a date, use a typical and reasonable date.
Ensure the query:
- Is clear and concise
- Demonstrates typical use cases
- Includes all necessary parameters in a meaningful way. For numerical parameters, it could be either numbers or words
- Across a variety level of difficulties, ranging from beginner and advanced use cases
- The corresponding result's parameter types and ranges match with the function's descriptions
Ensure the answer:
- Is a list of function calls in JSON format
- The length of the answer list should be equal to the number of requests in the query
- Can solve all the requests in the query effectively"},
# {'role': 'user',
# 'content': 'Here are examples of queries and the corresponding answers for similar functions:
QUERY:
What is the binary sum of 10010 and 11101?
ANSWER:
[{"name": "binary_addition", "arguments": {"a": "10010", "b": "11101"}}]
Note that the query could be interpreted as a combination of several independent requests.
Based on these examples, generate 2 diverse query and answer pairs for the function `getrandommovie`
The detailed function description is the following:
Returns a list of random movies from a database by calling an external API.
The output MUST strictly adhere to the following JSON format, and NO other text MUST be included:
生成带结构化输出的内容¶
from distilabel.steps.tasks import ApiGenGenerator
from distilabel.models import InferenceEndpointsLLM
llm=InferenceEndpointsLLM(
model_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
tokenizer="meta-llama/Meta-Llama-3.1-70B-Instruct",
generation_kwargs={
"temperature": 0.7,
"max_new_tokens": 1024,
},
)
apigen = ApiGenGenerator(
use_default_structured_output=True,
llm=llm
)
apigen.load()
res_struct = next(
apigen.process(
[
{
"examples": 'QUERY:
What is the binary sum of 10010 and 11101?
ANSWER:
[{"name": "binary_addition", "arguments": {"a": "10010", "b": "11101"}}]',
"func_name": "getrandommovie",
"func_desc": "Returns a list of random movies from a database by calling an external API."
}
]
)
)
res_struct
# [{'examples': 'QUERY:
What is the binary sum of 10010 and 11101?
ANSWER:
[{"name": "binary_addition", "arguments": {"a": "10010", "b": "11101"}}]',
# 'number': 1,
# 'func_name': 'getrandommovie',
# 'func_desc': 'Returns a list of random movies from a database by calling an external API.',
# 'queries': ["I'm bored and want to watch a movie. Can you suggest some movies?",
# "My family and I are planning a movie night. We can't decide on what to watch. Can you suggest some random movie titles?"],
# 'answers': [[{'arguments': {}, 'name': 'getrandommovie'}],
# [{'arguments': {}, 'name': 'getrandommovie'}]],
# 'raw_input_api_gen_generator_0': [{'role': 'system',
# 'content': "You are a data labeler. Your responsibility is to generate a set of diverse queries and corresponding answers for the given functions in JSON format.
Construct queries and answers that exemplify how to use these functions in a practical scenario. Include in each query specific, plausible values for each parameter. For instance, if the function requires a date, use a typical and reasonable date.
Ensure the query:
- Is clear and concise
- Demonstrates typical use cases
- Includes all necessary parameters in a meaningful way. For numerical parameters, it could be either numbers or words
- Across a variety level of difficulties, ranging from beginner and advanced use cases
- The corresponding result's parameter types and ranges match with the function's descriptions
Ensure the answer:
- Is a list of function calls in JSON format
- The length of the answer list should be equal to the number of requests in the query
- Can solve all the requests in the query effectively"},
# {'role': 'user',
# 'content': 'Here are examples of queries and the corresponding answers for similar functions:
QUERY:
What is the binary sum of 10010 and 11101?
ANSWER:
[{"name": "binary_addition", "arguments": {"a": "10010", "b": "11101"}}]
Note that the query could be interpreted as a combination of several independent requests.
Based on these examples, generate 2 diverse query and answer pairs for the function `getrandommovie`
The detailed function description is the following:
Returns a list of random movies from a database by calling an external API.
Now please generate 2 diverse query and answer pairs following the above format.'}]},
# 'model_name': 'meta-llama/Meta-Llama-3.1-70B-Instruct'}]