跳到内容

APIGenGenerator

以 JSON 格式为给定函数生成查询和答案。

APIGenGenerator 的灵感来自 APIGen pipeline,它旨在生成可验证和多样化的函数调用数据集。此任务为给定函数生成一组多样化的查询和相应的 JSON 格式的答案。

属性

  • system_prompt: 引导用户生成查询和答案的系统提示。

  • use_tools: 是否使用提示中可用的工具来生成查询和答案。如果输入中给出了工具,它们将被添加到提示中。

  • number: 要生成的查询数量。它可以是一个列表,其中每个数字将被随机选择,或者是一个字典,其中包含查询数量和每个数量的概率。例如:number=1, number=[1, 2, 3], number={1: 0.5, 2: 0.3, 3: 0.2} 都是有效的输入。它对应于要并行生成的查询数量。

  • use_default_structured_output: 是否使用默认的结构化输出。

输入和输出列

graph TD
    subgraph Dataset
        subgraph Columns
            ICOL0[examples]
            ICOL1[func_name]
            ICOL2[func_desc]
            ICOL3[tools]
        end
        subgraph New columns
            OCOL0[query]
            OCOL1[answers]
        end
    end

    subgraph APIGenGenerator
        StepInput[Input Columns: examples, func_name, func_desc, tools]
        StepOutput[Output Columns: query, answers]
    end

    ICOL0 --> StepInput
    ICOL1 --> StepInput
    ICOL2 --> StepInput
    ICOL3 --> StepInput
    StepOutput --> OCOL0
    StepOutput --> OCOL1
    StepInput --> StepOutput

输入

  • examples (str): 用作少量示例来指导模型的示例。

  • func_name (str): 要生成的功能的名称。

  • func_desc (str): 功能应执行的操作的描述。

  • tools (str): 包含功能工具表示的 JSON 格式字符串。

输出

  • query (str): 查询列表。

  • answers (str): JSON 格式字符串,其中包含答案列表,其中包含要传递给函数的字典形式的信息。

示例

生成不带结构化输出的内容(原始实现)

from distilabel.steps.tasks import ApiGenGenerator
from distilabel.models import InferenceEndpointsLLM

llm=InferenceEndpointsLLM(
    model_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
    generation_kwargs={
        "temperature": 0.7,
        "max_new_tokens": 1024,
    },
)
apigen = ApiGenGenerator(
    use_default_structured_output=False,
    llm=llm
)
apigen.load()

res = next(
    apigen.process(
        [
            {
                "examples": 'QUERY:
What is the binary sum of 10010 and 11101?
ANSWER:
[{"name": "binary_addition", "arguments": {"a": "10010", "b": "11101"}}]',
                "func_name": "getrandommovie",
                "func_desc": "Returns a list of random movies from a database by calling an external API."
            }
        ]
    )
)
res
# [{'examples': 'QUERY:
What is the binary sum of 10010 and 11101?
ANSWER:
[{"name": "binary_addition", "arguments": {"a": "10010", "b": "11101"}}]',
# 'number': 1,
# 'func_name': 'getrandommovie',
# 'func_desc': 'Returns a list of random movies from a database by calling an external API.',
# 'queries': ['I want to watch a movie tonight, can you recommend a random one from your database?',
# 'Give me 5 random movie suggestions from your database to plan my weekend.'],
# 'answers': [[{'name': 'getrandommovie', 'arguments': {}}],
# [{'name': 'getrandommovie', 'arguments': {}},
#     {'name': 'getrandommovie', 'arguments': {}},
#     {'name': 'getrandommovie', 'arguments': {}},
#     {'name': 'getrandommovie', 'arguments': {}},
#     {'name': 'getrandommovie', 'arguments': {}}]],
# 'raw_input_api_gen_generator_0': [{'role': 'system',
#     'content': "You are a data labeler. Your responsibility is to generate a set of diverse queries and corresponding answers for the given functions in JSON format.

Construct queries and answers that exemplify how to use these functions in a practical scenario. Include in each query specific, plausible values for each parameter. For instance, if the function requires a date, use a typical and reasonable date.

Ensure the query:
- Is clear and concise
- Demonstrates typical use cases
- Includes all necessary parameters in a meaningful way. For numerical parameters, it could be either numbers or words
- Across a variety level of difficulties, ranging from beginner and advanced use cases
- The corresponding result's parameter types and ranges match with the function's descriptions

Ensure the answer:
- Is a list of function calls in JSON format
- The length of the answer list should be equal to the number of requests in the query
- Can solve all the requests in the query effectively"},
#     {'role': 'user',
#     'content': 'Here are examples of queries and the corresponding answers for similar functions:
QUERY:
What is the binary sum of 10010 and 11101?
ANSWER:
[{"name": "binary_addition", "arguments": {"a": "10010", "b": "11101"}}]

Note that the query could be interpreted as a combination of several independent requests.
Based on these examples, generate 2 diverse query and answer pairs for the function `getrandommovie`
The detailed function description is the following:
Returns a list of random movies from a database by calling an external API.

The output MUST strictly adhere to the following JSON format, and NO other text MUST be included:

生成带结构化输出的内容

from distilabel.steps.tasks import ApiGenGenerator
from distilabel.models import InferenceEndpointsLLM

llm=InferenceEndpointsLLM(
    model_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
    tokenizer="meta-llama/Meta-Llama-3.1-70B-Instruct",
    generation_kwargs={
        "temperature": 0.7,
        "max_new_tokens": 1024,
    },
)
apigen = ApiGenGenerator(
    use_default_structured_output=True,
    llm=llm
)
apigen.load()

res_struct = next(
    apigen.process(
        [
            {
                "examples": 'QUERY:
What is the binary sum of 10010 and 11101?
ANSWER:
[{"name": "binary_addition", "arguments": {"a": "10010", "b": "11101"}}]',
                "func_name": "getrandommovie",
                "func_desc": "Returns a list of random movies from a database by calling an external API."
            }
        ]
    )
)
res_struct
# [{'examples': 'QUERY:
What is the binary sum of 10010 and 11101?
ANSWER:
[{"name": "binary_addition", "arguments": {"a": "10010", "b": "11101"}}]',
# 'number': 1,
# 'func_name': 'getrandommovie',
# 'func_desc': 'Returns a list of random movies from a database by calling an external API.',
# 'queries': ["I'm bored and want to watch a movie. Can you suggest some movies?",
# "My family and I are planning a movie night. We can't decide on what to watch. Can you suggest some random movie titles?"],
# 'answers': [[{'arguments': {}, 'name': 'getrandommovie'}],
# [{'arguments': {}, 'name': 'getrandommovie'}]],
# 'raw_input_api_gen_generator_0': [{'role': 'system',
#     'content': "You are a data labeler. Your responsibility is to generate a set of diverse queries and corresponding answers for the given functions in JSON format.

Construct queries and answers that exemplify how to use these functions in a practical scenario. Include in each query specific, plausible values for each parameter. For instance, if the function requires a date, use a typical and reasonable date.

Ensure the query:
- Is clear and concise
- Demonstrates typical use cases
- Includes all necessary parameters in a meaningful way. For numerical parameters, it could be either numbers or words
- Across a variety level of difficulties, ranging from beginner and advanced use cases
- The corresponding result's parameter types and ranges match with the function's descriptions

Ensure the answer:
- Is a list of function calls in JSON format
- The length of the answer list should be equal to the number of requests in the query
- Can solve all the requests in the query effectively"},
#     {'role': 'user',
#     'content': 'Here are examples of queries and the corresponding answers for similar functions:
QUERY:
What is the binary sum of 10010 and 11101?
ANSWER:
[{"name": "binary_addition", "arguments": {"a": "10010", "b": "11101"}}]

Note that the query could be interpreted as a combination of several independent requests.
Based on these examples, generate 2 diverse query and answer pairs for the function `getrandommovie`
The detailed function description is the following:
Returns a list of random movies from a database by calling an external API.

Now please generate 2 diverse query and answer pairs following the above format.'}]},
# 'model_name': 'meta-llama/Meta-Llama-3.1-70B-Instruct'}]

参考