跳到内容

任务库

本节包含在 distilabel 中实现的现有 Task 子类。

tasks

APIGenExecutionChecker

基类:Step

执行生成的函数调用。

此 step 检查模型生成的给定答案(由 APIGenGenerator 生成)是否可以针对给定的库执行(由 libpath 给出,libpath 是指向包含函数的 python .py 文件的字符串)。

属性

名称 类型 描述
libpath str

我们将从中检索函数的库的路径。它也可以指向包含函数的文件夹。在这种情况下,文件夹布局应为包含 .py 文件的文件夹,每个文件包含一个函数,函数名称与文件名相同。

check_is_dangerous bool

用于排除一些潜在危险函数的布尔值,它包含在测试中找到的一些启发式方法。这些函数可以运行子进程、处理操作系统或进行其他潜在危险的操作。默认为 True。

输入列
  • answers (str): 包含要传递给函数的参数的列表,从字典列表转储为字符串。应使用 json.loads 加载。
输出列
  • keep_row_after_execution_check (bool): 是否应保留该函数。
  • execution_result (str): 执行函数的结果。
类别
  • 过滤
  • 执行
参考

示例

使用来自 LLM 的答案执行给定库中的函数

from distilabel.steps.tasks import APIGenExecutionChecker

# For the libpath you can use as an example the file at the tests folder:
# ../distilabel/tests/unit/steps/tasks/apigen/_sample_module.py
task = APIGenExecutionChecker(
    libpath="../distilabel/tests/unit/steps/tasks/apigen/_sample_module.py",
)
task.load()

res = next(
    task.process(
        [
            {
                "answers": [
                    {
                        "arguments": {
                            "initial_velocity": 0.2,
                            "acceleration": 0.1,
                            "time": 0.5,
                        },
                        "name": "final_velocity",
                    }
                ],
            }
        ]
    )
)
res
#[{'answers': [{'arguments': {'initial_velocity': 0.2, 'acceleration': 0.1, 'time': 0.5}, 'name': 'final_velocity'}], 'keep_row_after_execution_check': True, 'execution_result': ['0.25']}]
源代码位于 src/distilabel/steps/tasks/apigen/execution_checker.py
class APIGenExecutionChecker(Step):
    """Executes the generated function calls.

    This step checks if a given answer from a model as generated by `APIGenGenerator`
    can be executed against the given library (given by `libpath`, which is a string
    pointing to a python .py file with functions).

    Attributes:
        libpath: The path to the library where we will retrieve the functions.
            It can also point to a folder with the functions. In this case, the folder
            layout should be a folder with .py files, each containing a single function,
            the name of the function being the same as the filename.
        check_is_dangerous: Bool to exclude some potentially dangerous functions, it contains
            some heuristics found while testing. This functions can run subprocesses, deal with
            the OS, or have other potentially dangerous operations. Defaults to True.

    Input columns:
        - answers (`str`): List with arguments to be passed to the function,
            dumped as a string from a list of dictionaries. Should be loaded using
            `json.loads`.

    Output columns:
        - keep_row_after_execution_check (`bool`): Whether the function should be kept or not.
        - execution_result (`str`): The result from executing the function.

    Categories:
        - filtering
        - execution

    References:
        - [APIGen: Automated Pipeline for Generating Verifiable and Diverse Function-Calling Datasets](https://arxiv.org/abs/2406.18518)
        - [Salesforce/xlam-function-calling-60k](https://hugging-face.cn/datasets/Salesforce/xlam-function-calling-60k)

    Examples:
        Execute a function from a given library with the answer from an LLM:

        ```python
        from distilabel.steps.tasks import APIGenExecutionChecker

        # For the libpath you can use as an example the file at the tests folder:
        # ../distilabel/tests/unit/steps/tasks/apigen/_sample_module.py
        task = APIGenExecutionChecker(
            libpath="../distilabel/tests/unit/steps/tasks/apigen/_sample_module.py",
        )
        task.load()

        res = next(
            task.process(
                [
                    {
                        "answers": [
                            {
                                "arguments": {
                                    "initial_velocity": 0.2,
                                    "acceleration": 0.1,
                                    "time": 0.5,
                                },
                                "name": "final_velocity",
                            }
                        ],
                    }
                ]
            )
        )
        res
        #[{'answers': [{'arguments': {'initial_velocity': 0.2, 'acceleration': 0.1, 'time': 0.5}, 'name': 'final_velocity'}], 'keep_row_after_execution_check': True, 'execution_result': ['0.25']}]
        ```
    """

    libpath: str = Field(
        default=...,
        description=(
            "The path to the library where we will retrieve the functions, "
            "or a folder with python files named the same as the functions they contain.",
        ),
    )
    check_is_dangerous: bool = Field(
        default=True,
        description=(
            "Bool to exclude some potentially dangerous functions, it contains "
            "some heuristics found while testing. This functions can run subprocesses, "
            "deal with the OS, or have other potentially dangerous operations.",
        ),
    )

    _toolbox: Union["ModuleType", None] = PrivateAttr(None)

    def load(self) -> None:
        """Loads the library where the functions will be extracted from."""
        super().load()
        if Path(self.libpath).suffix == ".py":
            self._toolbox = load_module_from_path(self.libpath)

    def unload(self) -> None:
        self._toolbox = None

    @property
    def inputs(self) -> "StepColumns":
        """The inputs for the task are those found in the original dataset."""
        return ["answers"]

    @property
    def outputs(self) -> "StepColumns":
        """The outputs are the columns required by `APIGenGenerator` task."""
        return ["keep_row_after_execution_check", "execution_result"]

    def _get_function(self, function_name: str) -> Callable:
        """Retrieves the function from the toolbox.

        Args:
            function_name: The name of the function to retrieve.

        Returns:
            Callable: The function to be executed.
        """
        if self._toolbox:
            return getattr(self._toolbox, function_name, None)
        try:
            toolbox = load_module_from_path(
                str(Path(self.libpath) / f"{function_name}.py")
            )
            return getattr(toolbox, function_name, None)
        except FileNotFoundError:
            return None
        except Exception as e:
            self._logger.warning(f"Error loading function '{function_name}': {e}")
            return None

    def _is_dangerous(self, function: Callable) -> bool:
        """Checks if a function is dangerous to remove it.
        Contains a list of heuristics to avoid executing possibly dangerous functions.
        """
        source_code = inspect.getsource(function)
        # We don't want to execute functions that use subprocess
        if (
            ("subprocess." in source_code)
            or ("os.system(" in source_code)
            or ("input(" in source_code)
            # Avoiding threading
            or ("threading.Thread(" in source_code)
            or ("exec(" in source_code)
            # Avoiding argparse (not sure why)
            or ("argparse.ArgumentParser(" in source_code)
            # Avoiding logging changing the levels to not mess with the logs
            or (".setLevel(" in source_code)
            # Don't run a test battery
            or ("unittest.main(" in source_code)
            # Avoid exiting the program
            or ("sys.exit(" in source_code)
            or ("exit(" in source_code)
            or ("raise SystemExit(" in source_code)
            or ("multiprocessing.Pool(" in source_code)
        ):
            return True
        return False

    @override
    def process(self, inputs: StepInput) -> "StepOutput":
        """Checks the answer to see if it can be executed.
        Captures the possible errors and returns them.

        If a single example is provided, it is copied to avoid raising an error.

        Args:
            inputs: A list of dictionaries with the input data.

        Yields:
            A list of dictionaries with the output data.
        """
        for input in inputs:
            output = []
            if input["answers"]:
                answers = json.loads(input["answers"])
            else:
                input.update(
                    **{
                        "keep_row_after_execution_check": False,
                        "execution_result": ["No answers were provided."],
                    }
                )
                continue
            for answer in answers:
                if answer is None:
                    output.append(
                        {
                            "keep": False,
                            "execution_result": "Nothing was generated for this answer.",
                        }
                    )
                    continue

                function_name = answer.get("name", None)
                arguments = answer.get("arguments", None)

                self._logger.debug(
                    f"Executing function '{function_name}' with arguments: {arguments}"
                )
                function = self._get_function(function_name)

                if self.check_is_dangerous:
                    if function and self._is_dangerous(function):
                        function = None

                if function is None:
                    output.append(
                        {
                            "keep": False,
                            "execution_result": f"Function '{function_name}' not found.",
                        }
                    )
                else:
                    execution = execute_from_response(function, arguments)
                    output.append(
                        {
                            "keep": execution["keep"],
                            "execution_result": execution["execution_result"],
                        }
                    )
            # We only consider a good response if all the answers were executed successfully,
            # but keep the reasons for further review if needed.
            input.update(
                **{
                    "keep_row_after_execution_check": all(
                        o["keep"] is True for o in output
                    ),
                    "execution_result": [o["execution_result"] for o in output],
                }
            )

        yield inputs
inputs property

任务的输入是原始数据集中找到的输入。

outputs property

输出是 APIGenGenerator 任务所需的列。

load()

加载将从中提取函数的库。

源代码位于 src/distilabel/steps/tasks/apigen/execution_checker.py
def load(self) -> None:
    """Loads the library where the functions will be extracted from."""
    super().load()
    if Path(self.libpath).suffix == ".py":
        self._toolbox = load_module_from_path(self.libpath)
_get_function(function_name)

从工具箱中检索函数。

参数

名称 类型 描述 默认值
function_name str

要检索的函数的名称。

必需

返回

名称 类型 描述
Callable Callable

要执行的函数。

源代码位于 src/distilabel/steps/tasks/apigen/execution_checker.py
def _get_function(self, function_name: str) -> Callable:
    """Retrieves the function from the toolbox.

    Args:
        function_name: The name of the function to retrieve.

    Returns:
        Callable: The function to be executed.
    """
    if self._toolbox:
        return getattr(self._toolbox, function_name, None)
    try:
        toolbox = load_module_from_path(
            str(Path(self.libpath) / f"{function_name}.py")
        )
        return getattr(toolbox, function_name, None)
    except FileNotFoundError:
        return None
    except Exception as e:
        self._logger.warning(f"Error loading function '{function_name}': {e}")
        return None
_is_dangerous(function)

检查函数是否危险以将其移除。包含启发式方法列表,以避免执行可能危险的函数。

源代码位于 src/distilabel/steps/tasks/apigen/execution_checker.py
def _is_dangerous(self, function: Callable) -> bool:
    """Checks if a function is dangerous to remove it.
    Contains a list of heuristics to avoid executing possibly dangerous functions.
    """
    source_code = inspect.getsource(function)
    # We don't want to execute functions that use subprocess
    if (
        ("subprocess." in source_code)
        or ("os.system(" in source_code)
        or ("input(" in source_code)
        # Avoiding threading
        or ("threading.Thread(" in source_code)
        or ("exec(" in source_code)
        # Avoiding argparse (not sure why)
        or ("argparse.ArgumentParser(" in source_code)
        # Avoiding logging changing the levels to not mess with the logs
        or (".setLevel(" in source_code)
        # Don't run a test battery
        or ("unittest.main(" in source_code)
        # Avoid exiting the program
        or ("sys.exit(" in source_code)
        or ("exit(" in source_code)
        or ("raise SystemExit(" in source_code)
        or ("multiprocessing.Pool(" in source_code)
    ):
        return True
    return False
process(inputs)

检查答案以查看是否可以执行。捕获可能的错误并返回它们。

如果提供了单个示例,则会复制它以避免引发错误。

参数

名称 类型 描述 默认值
inputs StepInput

包含输入数据的字典列表。

必需

产生

类型 描述
StepOutput

包含输出数据的字典列表。

源代码位于 src/distilabel/steps/tasks/apigen/execution_checker.py
@override
def process(self, inputs: StepInput) -> "StepOutput":
    """Checks the answer to see if it can be executed.
    Captures the possible errors and returns them.

    If a single example is provided, it is copied to avoid raising an error.

    Args:
        inputs: A list of dictionaries with the input data.

    Yields:
        A list of dictionaries with the output data.
    """
    for input in inputs:
        output = []
        if input["answers"]:
            answers = json.loads(input["answers"])
        else:
            input.update(
                **{
                    "keep_row_after_execution_check": False,
                    "execution_result": ["No answers were provided."],
                }
            )
            continue
        for answer in answers:
            if answer is None:
                output.append(
                    {
                        "keep": False,
                        "execution_result": "Nothing was generated for this answer.",
                    }
                )
                continue

            function_name = answer.get("name", None)
            arguments = answer.get("arguments", None)

            self._logger.debug(
                f"Executing function '{function_name}' with arguments: {arguments}"
            )
            function = self._get_function(function_name)

            if self.check_is_dangerous:
                if function and self._is_dangerous(function):
                    function = None

            if function is None:
                output.append(
                    {
                        "keep": False,
                        "execution_result": f"Function '{function_name}' not found.",
                    }
                )
            else:
                execution = execute_from_response(function, arguments)
                output.append(
                    {
                        "keep": execution["keep"],
                        "execution_result": execution["execution_result"],
                    }
                )
        # We only consider a good response if all the answers were executed successfully,
        # but keep the reasons for further review if needed.
        input.update(
            **{
                "keep_row_after_execution_check": all(
                    o["keep"] is True for o in output
                ),
                "execution_result": [o["execution_result"] for o in output],
            }
        )

    yield inputs

APIGenGenerator

基类:Task

以 JSON 格式为给定函数生成查询和答案。

The `APIGenGenerator` is inspired by the APIGen pipeline, which was designed to generate
verifiable and diverse function-calling datasets. The task generates a set of diverse queries
and corresponding answers for the given functions in JSON format.

Attributes:
    system_prompt: The system prompt to guide the user in the generation of queries and answers.
    use_tools: Whether to use the tools available in the prompt to generate the queries and answers.
        In case the tools are given in the input, they will be added to the prompt.
    number: The number of queries to generate. It can be a list, where each number will be
        chosen randomly, or a dictionary with the number of queries and the probability of each.
        I.e: `number=1`, `number=[1, 2, 3]`, `number={1: 0.5, 2: 0.3, 3: 0.2}` are all valid inputs.
        It corresponds to the number of parallel queries to generate.
    use_default_structured_output: Whether to use the default structured output or not.

Input columns:
    - examples (`str`): Examples used as few shots to guide the model.
    - func_name (`str`): Name for the function to generate.
    - func_desc (`str`): Description of what the function should do.
    - tools (`str`): JSON formatted string containing the tool representation of the function.

Output columns:
    - query (`str`): The list of queries.
    - answers (`str`): JSON formatted string with the list of answers, containing the info as
        a dictionary to be passed to the functions.

Categories:
    - text-generation

References:
    - [APIGen: Automated Pipeline for Generating Verifiable and Diverse Function-Calling Datasets](https://arxiv.org/abs/2406.18518)
    - [Salesforce/xlam-function-calling-60k](https://hugging-face.cn/datasets/Salesforce/xlam-function-calling-60k)

Examples:
    Generate without structured output (original implementation):

    ```python
    from distilabel.steps.tasks import ApiGenGenerator
    from distilabel.models import InferenceEndpointsLLM

    llm=InferenceEndpointsLLM(
        model_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
        generation_kwargs={
            "temperature": 0.7,
            "max_new_tokens": 1024,
        },
    )
    apigen = ApiGenGenerator(
        use_default_structured_output=False,
        llm=llm
    )
    apigen.load()

    res = next(
        apigen.process(
            [
                {
                    "examples": 'QUERY:

10010 和 11101 的二进制和是多少?答案:[{"name": "binary_addition", "arguments": {"a": "10010", "b": "11101"}}]', "func_name": "getrandommovie", "func_desc": "通过调用外部 API 从数据库返回随机电影列表。" } ] ) ) res # [{'examples': '查询:10010 和 11101 的二进制和是多少?答案:[{"name": "binary_addition", "arguments": {"a": "10010", "b": "11101"}}]', # 'number': 1, # 'func_name': 'getrandommovie', # 'func_desc': '通过调用外部 API 从数据库返回随机电影列表。', # 'queries': ['我今晚想看电影,你能从你的数据库中推荐一部随机电影吗?', # '给我 5 个来自你的数据库的随机电影建议,以便计划我的周末。'], # 'answers': [[{'name': 'getrandommovie', 'arguments': {}}], # [{'name': 'getrandommovie', 'arguments': {}}, # {'name': 'getrandommovie', 'arguments': {}}, # {'name': 'getrandommovie', 'arguments': {}}, # {'name': 'getrandommovie', 'arguments': {}}, # {'name': 'getrandommovie', 'arguments': {}}]], # 'raw_input_api_gen_generator_0': [{'role': 'system', # 'content': "你是一名数据标注员。你的职责是以 JSON 格式为给定函数生成一组多样化的查询和相应的答案。

构建查询和答案,以示例说明如何在实际场景中使用这些函数。在每个查询中包含每个参数的特定、合理的数值。例如,如果函数需要日期,请使用典型且合理的日期。

确保查询:- 清晰简洁 - 展示典型的用例 - 以有意义的方式包含所有必要的参数。对于数值参数,可以是数字或单词 - 涵盖各种难度级别,从初学者到高级用例 - 相应的参数类型和范围与函数描述匹配

确保答案:- 是 JSON 格式的函数调用列表 - 答案列表的长度应等于查询中的请求数量 - 可以有效地解决查询中的所有请求"}, # {'role': 'user', # 'content': '以下是类似函数的查询和相应答案的示例:查询:10010 和 11101 的二进制和是多少?答案:[{"name": "binary_addition", "arguments": {"a": "10010", "b": "11101"}}]

请注意,查询可以解释为几个独立请求的组合。基于这些示例,为函数 getrandommovie 生成 2 个不同的查询和答案对。详细的函数描述如下:通过调用外部 API 从数据库返回随机电影列表。

输出必须严格遵守以下 JSON 格式,并且不得包含任何其他文本

[
   {
       "query": "The generated query.",
       "answers": [
           {
               "name": "api_name",
               "arguments": {
                   "arg_name": "value"
                   ... (more arguments as required)
               }
           },
           ... (more API calls as required)
       ]
   }
]

现在请按照上述格式生成 2 个不同的查询和答案对。'}]}, # 'model_name': 'meta-llama/Meta-Llama-3.1-70B-Instruct'}] ```

    Generate with structured output:

    ```python
    from distilabel.steps.tasks import ApiGenGenerator
    from distilabel.models import InferenceEndpointsLLM

    llm=InferenceEndpointsLLM(
        model_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
        tokenizer="meta-llama/Meta-Llama-3.1-70B-Instruct",
        generation_kwargs={
            "temperature": 0.7,
            "max_new_tokens": 1024,
        },
    )
    apigen = ApiGenGenerator(
        use_default_structured_output=True,
        llm=llm
    )
    apigen.load()

    res_struct = next(
        apigen.process(
            [
                {
                    "examples": 'QUERY:

10010 和 11101 的二进制和是多少?答案:[{"name": "binary_addition", "arguments": {"a": "10010", "b": "11101"}}]', "func_name": "getrandommovie", "func_desc": "通过调用外部 API 从数据库返回随机电影列表。" } ] ) ) res_struct # [{'examples': '查询:10010 和 11101 的二进制和是多少?答案:[{"name": "binary_addition", "arguments": {"a": "10010", "b": "11101"}}]', # 'number': 1, # 'func_name': 'getrandommovie', # 'func_desc': '通过调用外部 API 从数据库返回随机电影列表。', # 'queries': ["我很无聊,想看电影。你能推荐一些电影吗?", # "我和我的家人正在计划一个电影之夜。我们无法决定看什么。你能推荐一些随机电影名称吗?"], # 'answers': [[{'arguments': {}, 'name': 'getrandommovie'}], # [{'arguments': {}, 'name': 'getrandommovie'}]], # 'raw_input_api_gen_generator_0': [{'role': 'system', # 'content': "你是一名数据标注员。你的职责是以 JSON 格式为给定函数生成一组多样化的查询和相应的答案。

构建查询和答案,以示例说明如何在实际场景中使用这些函数。在每个查询中包含每个参数的特定、合理的数值。例如,如果函数需要日期,请使用典型且合理的日期。

确保查询:- 清晰简洁 - 展示典型的用例 - 以有意义的方式包含所有必要的参数。对于数值参数,可以是数字或单词 - 涵盖各种难度级别,从初学者到高级用例 - 相应的参数类型和范围与函数描述匹配

确保答案:- 是 JSON 格式的函数调用列表 - 答案列表的长度应等于查询中的请求数量 - 可以有效地解决查询中的所有请求"}, # {'role': 'user', # 'content': '以下是类似函数的查询和相应答案的示例:查询:10010 和 11101 的二进制和是多少?答案:[{"name": "binary_addition", "arguments": {"a": "10010", "b": "11101"}}]

请注意,查询可以解释为几个独立请求的组合。基于这些示例,为函数 getrandommovie 生成 2 个不同的查询和答案对。详细的函数描述如下:通过调用外部 API 从数据库返回随机电影列表。

现在请按照上述格式生成 2 个不同的查询和答案对。'}]}, # 'model_name': 'meta-llama/Meta-Llama-3.1-70B-Instruct'}] ```

源代码位于 src/distilabel/steps/tasks/apigen/generator.py
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
class APIGenGenerator(Task):
    """Generate queries and answers for the given functions in JSON format.

    The `APIGenGenerator` is inspired by the APIGen pipeline, which was designed to generate
    verifiable and diverse function-calling datasets. The task generates a set of diverse queries
    and corresponding answers for the given functions in JSON format.

    Attributes:
        system_prompt: The system prompt to guide the user in the generation of queries and answers.
        use_tools: Whether to use the tools available in the prompt to generate the queries and answers.
            In case the tools are given in the input, they will be added to the prompt.
        number: The number of queries to generate. It can be a list, where each number will be
            chosen randomly, or a dictionary with the number of queries and the probability of each.
            I.e: `number=1`, `number=[1, 2, 3]`, `number={1: 0.5, 2: 0.3, 3: 0.2}` are all valid inputs.
            It corresponds to the number of parallel queries to generate.
        use_default_structured_output: Whether to use the default structured output or not.

    Input columns:
        - examples (`str`): Examples used as few shots to guide the model.
        - func_name (`str`): Name for the function to generate.
        - func_desc (`str`): Description of what the function should do.
        - tools (`str`): JSON formatted string containing the tool representation of the function.

    Output columns:
        - query (`str`): The list of queries.
        - answers (`str`): JSON formatted string with the list of answers, containing the info as
            a dictionary to be passed to the functions.

    Categories:
        - text-generation

    References:
        - [APIGen: Automated Pipeline for Generating Verifiable and Diverse Function-Calling Datasets](https://arxiv.org/abs/2406.18518)
        - [Salesforce/xlam-function-calling-60k](https://hugging-face.cn/datasets/Salesforce/xlam-function-calling-60k)

    Examples:
        Generate without structured output (original implementation):

        ```python
        from distilabel.steps.tasks import ApiGenGenerator
        from distilabel.models import InferenceEndpointsLLM

        llm=InferenceEndpointsLLM(
            model_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
            generation_kwargs={
                "temperature": 0.7,
                "max_new_tokens": 1024,
            },
        )
        apigen = ApiGenGenerator(
            use_default_structured_output=False,
            llm=llm
        )
        apigen.load()

        res = next(
            apigen.process(
                [
                    {
                        "examples": 'QUERY:\nWhat is the binary sum of 10010 and 11101?\nANSWER:\n[{"name": "binary_addition", "arguments": {"a": "10010", "b": "11101"}}]',
                        "func_name": "getrandommovie",
                        "func_desc": "Returns a list of random movies from a database by calling an external API."
                    }
                ]
            )
        )
        res
        # [{'examples': 'QUERY:\nWhat is the binary sum of 10010 and 11101?\nANSWER:\n[{"name": "binary_addition", "arguments": {"a": "10010", "b": "11101"}}]',
        # 'number': 1,
        # 'func_name': 'getrandommovie',
        # 'func_desc': 'Returns a list of random movies from a database by calling an external API.',
        # 'queries': ['I want to watch a movie tonight, can you recommend a random one from your database?',
        # 'Give me 5 random movie suggestions from your database to plan my weekend.'],
        # 'answers': [[{'name': 'getrandommovie', 'arguments': {}}],
        # [{'name': 'getrandommovie', 'arguments': {}},
        #     {'name': 'getrandommovie', 'arguments': {}},
        #     {'name': 'getrandommovie', 'arguments': {}},
        #     {'name': 'getrandommovie', 'arguments': {}},
        #     {'name': 'getrandommovie', 'arguments': {}}]],
        # 'raw_input_api_gen_generator_0': [{'role': 'system',
        #     'content': "You are a data labeler. Your responsibility is to generate a set of diverse queries and corresponding answers for the given functions in JSON format.\n\nConstruct queries and answers that exemplify how to use these functions in a practical scenario. Include in each query specific, plausible values for each parameter. For instance, if the function requires a date, use a typical and reasonable date.\n\nEnsure the query:\n- Is clear and concise\n- Demonstrates typical use cases\n- Includes all necessary parameters in a meaningful way. For numerical parameters, it could be either numbers or words\n- Across a variety level of difficulties, ranging from beginner and advanced use cases\n- The corresponding result's parameter types and ranges match with the function's descriptions\n\nEnsure the answer:\n- Is a list of function calls in JSON format\n- The length of the answer list should be equal to the number of requests in the query\n- Can solve all the requests in the query effectively"},
        #     {'role': 'user',
        #     'content': 'Here are examples of queries and the corresponding answers for similar functions:\nQUERY:\nWhat is the binary sum of 10010 and 11101?\nANSWER:\n[{"name": "binary_addition", "arguments": {"a": "10010", "b": "11101"}}]\n\nNote that the query could be interpreted as a combination of several independent requests.\nBased on these examples, generate 2 diverse query and answer pairs for the function `getrandommovie`\nThe detailed function description is the following:\nReturns a list of random movies from a database by calling an external API.\n\nThe output MUST strictly adhere to the following JSON format, and NO other text MUST be included:\n```json\n[\n   {\n       "query": "The generated query.",\n       "answers": [\n           {\n               "name": "api_name",\n               "arguments": {\n                   "arg_name": "value"\n                   ... (more arguments as required)\n               }\n           },\n           ... (more API calls as required)\n       ]\n   }\n]\n```\n\nNow please generate 2 diverse query and answer pairs following the above format.'}]},
        # 'model_name': 'meta-llama/Meta-Llama-3.1-70B-Instruct'}]
        ```

        Generate with structured output:

        ```python
        from distilabel.steps.tasks import ApiGenGenerator
        from distilabel.models import InferenceEndpointsLLM

        llm=InferenceEndpointsLLM(
            model_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
            tokenizer="meta-llama/Meta-Llama-3.1-70B-Instruct",
            generation_kwargs={
                "temperature": 0.7,
                "max_new_tokens": 1024,
            },
        )
        apigen = ApiGenGenerator(
            use_default_structured_output=True,
            llm=llm
        )
        apigen.load()

        res_struct = next(
            apigen.process(
                [
                    {
                        "examples": 'QUERY:\nWhat is the binary sum of 10010 and 11101?\nANSWER:\n[{"name": "binary_addition", "arguments": {"a": "10010", "b": "11101"}}]',
                        "func_name": "getrandommovie",
                        "func_desc": "Returns a list of random movies from a database by calling an external API."
                    }
                ]
            )
        )
        res_struct
        # [{'examples': 'QUERY:\nWhat is the binary sum of 10010 and 11101?\nANSWER:\n[{"name": "binary_addition", "arguments": {"a": "10010", "b": "11101"}}]',
        # 'number': 1,
        # 'func_name': 'getrandommovie',
        # 'func_desc': 'Returns a list of random movies from a database by calling an external API.',
        # 'queries': ["I'm bored and want to watch a movie. Can you suggest some movies?",
        # "My family and I are planning a movie night. We can't decide on what to watch. Can you suggest some random movie titles?"],
        # 'answers': [[{'arguments': {}, 'name': 'getrandommovie'}],
        # [{'arguments': {}, 'name': 'getrandommovie'}]],
        # 'raw_input_api_gen_generator_0': [{'role': 'system',
        #     'content': "You are a data labeler. Your responsibility is to generate a set of diverse queries and corresponding answers for the given functions in JSON format.\n\nConstruct queries and answers that exemplify how to use these functions in a practical scenario. Include in each query specific, plausible values for each parameter. For instance, if the function requires a date, use a typical and reasonable date.\n\nEnsure the query:\n- Is clear and concise\n- Demonstrates typical use cases\n- Includes all necessary parameters in a meaningful way. For numerical parameters, it could be either numbers or words\n- Across a variety level of difficulties, ranging from beginner and advanced use cases\n- The corresponding result's parameter types and ranges match with the function's descriptions\n\nEnsure the answer:\n- Is a list of function calls in JSON format\n- The length of the answer list should be equal to the number of requests in the query\n- Can solve all the requests in the query effectively"},
        #     {'role': 'user',
        #     'content': 'Here are examples of queries and the corresponding answers for similar functions:\nQUERY:\nWhat is the binary sum of 10010 and 11101?\nANSWER:\n[{"name": "binary_addition", "arguments": {"a": "10010", "b": "11101"}}]\n\nNote that the query could be interpreted as a combination of several independent requests.\nBased on these examples, generate 2 diverse query and answer pairs for the function `getrandommovie`\nThe detailed function description is the following:\nReturns a list of random movies from a database by calling an external API.\n\nNow please generate 2 diverse query and answer pairs following the above format.'}]},
        # 'model_name': 'meta-llama/Meta-Llama-3.1-70B-Instruct'}]
        ```
    """

    system_prompt: str = SYSTEM_PROMPT_API_GEN
    use_default_structured_output: bool = False
    number: Union[int, List[int], Dict[int, float]] = 1
    use_tools: bool = True

    _number: Union[int, None] = PrivateAttr(None)
    _fn_parallel_queries: Union[Callable[[], str], None] = PrivateAttr(None)
    _format_inst: Union[str, None] = PrivateAttr(None)

    def load(self) -> None:
        """Loads the template for the generator prompt."""
        super().load()
        _path = str(
            importlib_resources.files("distilabel")
            / "steps"
            / "tasks"
            / "templates"
            / "apigen"
            / "generator.jinja2"
        )
        self._template = Template(open(_path).read())
        self._format_inst = self._set_format_inst()

    def _parallel_queries(self, number: int) -> Callable[[int], str]:
        """Prepares the function to update the parallel queries guide in the prompt.

        Raises:
            ValueError: if `is_parallel` is not a boolean or a list of floats.

        Returns:
            The function to generate the parallel queries guide.
        """
        if number > 1:
            return (
                "It can contain multiple parallel queries in natural language for the given functions. "
                "They could use either the same function with different arguments or different functions.\n"
            )
        return ""

    def _get_number(self) -> int:
        """Generates the number of queries to generate in a single call.
        The number must be set to `_number` to avoid changing the original value
        when calling `_default_error`.
        """
        if isinstance(self.number, list):
            self._number = random.choice(self.number)
        elif isinstance(self.number, dict):
            self._number = random.choices(
                list(self.number.keys()), list(self.number.values())
            )[0]
        else:
            self._number = self.number
        return self._number

    def _set_format_inst(self) -> str:
        """Prepares the function to generate the formatted instructions for the prompt.

        If the default structured output is used, returns an empty string because nothing
        else is needed, otherwise, returns the original addition to the prompt to guide the model
        to generate a formatted JSON.
        """
        return (
            "\nThe output MUST strictly adhere to the following JSON format, and NO other text MUST be included:\n"
            "```\n"
            "[\n"
            "   {\n"
            '       "query": "The generated query.",\n'
            '       "answers": [\n'
            "           {\n"
            '               "name": "api_name",\n'
            '               "arguments": {\n'
            '                   "arg_name": "value"\n'
            "                   ... (more arguments as required)\n"
            "               }\n"
            "           },\n"
            "           ... (more API calls as required)\n"
            "       ]\n"
            "   }\n"
            "]\n"
            "```\n"
        )

    def _get_func_desc(self, input: Dict[str, Any]) -> str:
        """If available and required, will use the info from the tools in the
        prompt for extra information. Otherwise will use jut the function description.
        """
        if not self.use_tools:
            return input["func_desc"]
        extra = ""  # Extra information from the tools (if available will be added)
        if "tools" in input:
            extra = f"\n\nThis is the available tool to guide you (respect the order of the parameters):\n{input['tools']}"
        return input["func_desc"] + extra

    @property
    def inputs(self) -> "StepColumns":
        """The inputs for the task."""
        return {
            "examples": True,
            "func_name": True,
            "func_desc": True,
            "tools": False,
        }

    def format_input(self, input: Dict[str, Any]) -> "ChatType":
        """The input is formatted as a `ChatType`."""
        number = self._get_number()
        parallel_queries = self._parallel_queries(number)
        return [
            {"role": "system", "content": self.system_prompt},
            {
                "role": "user",
                "content": self._template.render(
                    examples=input["examples"],
                    parallel_queries=parallel_queries,
                    number=number,
                    func_name=input["func_name"],
                    func_desc=self._get_func_desc(input),
                    format_inst=self._format_inst,
                ),
            },
        ]

    @property
    def outputs(self) -> "StepColumns":
        """The output for the task are the queries and corresponding answers."""
        return ["query", "answers", "model_name"]

    def format_output(
        self, output: Union[str, None], input: Dict[str, Any]
    ) -> Dict[str, Any]:
        """The output is formatted as a list with the score of each instruction.

        Args:
            output: the raw output of the LLM.
            input: the input to the task. Used for obtaining the number of responses.

        Returns:
            A dict with the queries and answers pairs.
            The answers are an array of answers corresponding to the query.
            Each answer is represented as an object with the following properties:
                - name (string): The name of the tool used to generate the answer.
                - arguments (object): An object representing the arguments passed to the tool to generate the answer.
            Each argument is represented as a key-value pair, where the key is the parameter name and the
            value is the corresponding value.
        """
        if output is None:
            return self._default_error(input)

        if not self.use_default_structured_output:
            output = remove_fences(output)

        try:
            pairs = orjson.loads(output)
        except orjson.JSONDecodeError:
            return self._default_error(input)

        pairs = pairs["pairs"] if self.use_default_structured_output else pairs

        return self._format_output(pairs, input)

    def _format_output(
        self, pairs: Dict[str, Any], input: Dict[str, Any]
    ) -> Dict[str, Any]:
        """Parses the response, returning a dictionary with queries and answers.

        Args:
            pairs: The parsed dictionary from the LLM's output.
            input: The input from the `LLM`.

        Returns:
            Formatted output, where the `queries` are a list of strings, and the `answers`
            are a list of objects.
        """
        try:
            input.update(
                **{
                    "query": pairs[0]["query"],
                    "answers": json.dumps(pairs[0]["answers"]),
                }
            )
            return input
        except Exception as e:
            self._logger.error(f"Error formatting output: {e}, pairs: '{pairs}'")
            return self._default_error(input)

    def _default_error(self, input: Dict[str, Any]) -> Dict[str, Any]:
        """Returns a default error output, to fill the responses in case of failure."""
        input.update(
            **{
                "query": None,
                "answers": json.dumps([None] * self._number),
            }
        )
        return input

    @override
    def get_structured_output(self) -> Dict[str, Any]:
        """Creates the json schema to be passed to the LLM, to enforce generating
        a dictionary with the output which can be directly parsed as a python dictionary.

        The schema corresponds to the following:

        ```python
        from typing import Dict, List
        from pydantic import BaseModel


        class Answer(BaseModel):
            name: str
            arguments: Dict[str, str]

        class QueryAnswer(BaseModel):
            query: str
            answers: List[Answer]

        class QueryAnswerPairs(BaseModel):
            pairs: List[QueryAnswer]

        json.dumps(QueryAnswerPairs.model_json_schema(), indent=4)
        ```

        Returns:
            JSON Schema of the response to enforce.
        """
        return {
            "$defs": {
                "Answer": {
                    "properties": {
                        "name": {"title": "Name", "type": "string"},
                        "arguments": {
                            "additionalProperties": {"type": "string"},
                            "title": "Arguments",
                            "type": "object",
                        },
                    },
                    "required": ["name", "arguments"],
                    "title": "Answer",
                    "type": "object",
                },
                "QueryAnswer": {
                    "properties": {
                        "query": {"title": "Query", "type": "string"},
                        "answers": {
                            "items": {"$ref": "#/$defs/Answer"},
                            "title": "Answers",
                            "type": "array",
                        },
                    },
                    "required": ["query", "answers"],
                    "title": "QueryAnswer",
                    "type": "object",
                },
            },
            "properties": {
                "pairs": {
                    "items": {"$ref": "#/$defs/QueryAnswer"},
                    "title": "Pairs",
                    "type": "array",
                }
            },
            "required": ["pairs"],
            "title": "QueryAnswerPairs",
            "type": "object",
        }
inputs property

任务的输入。

outputs property

任务的输出是查询和相应的答案。

load()

加载生成器提示的模板。

源代码位于 src/distilabel/steps/tasks/apigen/generator.py
def load(self) -> None:
    """Loads the template for the generator prompt."""
    super().load()
    _path = str(
        importlib_resources.files("distilabel")
        / "steps"
        / "tasks"
        / "templates"
        / "apigen"
        / "generator.jinja2"
    )
    self._template = Template(open(_path).read())
    self._format_inst = self._set_format_inst()
_parallel_queries(number)

准备函数以更新提示中的并行查询指南。

引发

类型 描述
ValueError

如果 is_parallel 不是布尔值或浮点数列表。

返回

类型 描述
Callable[[int], str]

生成并行查询指南的函数。

源代码位于 src/distilabel/steps/tasks/apigen/generator.py
def _parallel_queries(self, number: int) -> Callable[[int], str]:
    """Prepares the function to update the parallel queries guide in the prompt.

    Raises:
        ValueError: if `is_parallel` is not a boolean or a list of floats.

    Returns:
        The function to generate the parallel queries guide.
    """
    if number > 1:
        return (
            "It can contain multiple parallel queries in natural language for the given functions. "
            "They could use either the same function with different arguments or different functions.\n"
        )
    return ""
_get_number()

生成在单个调用中生成的查询数量。必须将数字设置为 _number,以避免在调用 _default_error 时更改原始值。

源代码位于 src/distilabel/steps/tasks/apigen/generator.py
def _get_number(self) -> int:
    """Generates the number of queries to generate in a single call.
    The number must be set to `_number` to avoid changing the original value
    when calling `_default_error`.
    """
    if isinstance(self.number, list):
        self._number = random.choice(self.number)
    elif isinstance(self.number, dict):
        self._number = random.choices(
            list(self.number.keys()), list(self.number.values())
        )[0]
    else:
        self._number = self.number
    return self._number
_set_format_inst()

准备函数以生成提示的格式化指令。

如果使用默认的结构化输出,则返回一个空字符串,因为不需要其他任何操作,否则,返回添加到提示的原始内容,以指导模型生成格式化的 JSON。

源代码位于 src/distilabel/steps/tasks/apigen/generator.py
def _set_format_inst(self) -> str:
    """Prepares the function to generate the formatted instructions for the prompt.

    If the default structured output is used, returns an empty string because nothing
    else is needed, otherwise, returns the original addition to the prompt to guide the model
    to generate a formatted JSON.
    """
    return (
        "\nThe output MUST strictly adhere to the following JSON format, and NO other text MUST be included:\n"
        "```\n"
        "[\n"
        "   {\n"
        '       "query": "The generated query.",\n'
        '       "answers": [\n'
        "           {\n"
        '               "name": "api_name",\n'
        '               "arguments": {\n'
        '                   "arg_name": "value"\n'
        "                   ... (more arguments as required)\n"
        "               }\n"
        "           },\n"
        "           ... (more API calls as required)\n"
        "       ]\n"
        "   }\n"
        "]\n"
        "```\n"
    )
_get_func_desc(input)

如果可用且需要,将在提示中使用来自工具的信息以获取额外信息。否则将仅使用函数描述。

源代码位于 src/distilabel/steps/tasks/apigen/generator.py
def _get_func_desc(self, input: Dict[str, Any]) -> str:
    """If available and required, will use the info from the tools in the
    prompt for extra information. Otherwise will use jut the function description.
    """
    if not self.use_tools:
        return input["func_desc"]
    extra = ""  # Extra information from the tools (if available will be added)
    if "tools" in input:
        extra = f"\n\nThis is the available tool to guide you (respect the order of the parameters):\n{input['tools']}"
    return input["func_desc"] + extra
format_input(input)

输入格式化为 ChatType

源代码位于 src/distilabel/steps/tasks/apigen/generator.py
def format_input(self, input: Dict[str, Any]) -> "ChatType":
    """The input is formatted as a `ChatType`."""
    number = self._get_number()
    parallel_queries = self._parallel_queries(number)
    return [
        {"role": "system", "content": self.system_prompt},
        {
            "role": "user",
            "content": self._template.render(
                examples=input["examples"],
                parallel_queries=parallel_queries,
                number=number,
                func_name=input["func_name"],
                func_desc=self._get_func_desc(input),
                format_inst=self._format_inst,
            ),
        },
    ]
format_output(output, input)

输出格式化为包含每个指令分数的列表。

参数

名称 类型 描述 默认值
output Union[str, None]

LLM 的原始输出。

必需
input Dict[str, Any]

任务的输入。用于获取响应数量。

必需

返回

类型 描述
Dict[str, Any]

包含查询和答案对的字典。

Dict[str, Any]

答案是与查询对应的答案数组。

Dict[str, Any]

每个答案都表示为一个对象,具有以下属性:- name (string):用于生成答案的工具的名称。- arguments (object):表示传递给工具以生成答案的参数的对象。

Dict[str, Any]

每个参数都表示为键值对,其中键是参数名称,

Dict[str, Any]

值是对应的值。

源代码位于 src/distilabel/steps/tasks/apigen/generator.py
def format_output(
    self, output: Union[str, None], input: Dict[str, Any]
) -> Dict[str, Any]:
    """The output is formatted as a list with the score of each instruction.

    Args:
        output: the raw output of the LLM.
        input: the input to the task. Used for obtaining the number of responses.

    Returns:
        A dict with the queries and answers pairs.
        The answers are an array of answers corresponding to the query.
        Each answer is represented as an object with the following properties:
            - name (string): The name of the tool used to generate the answer.
            - arguments (object): An object representing the arguments passed to the tool to generate the answer.
        Each argument is represented as a key-value pair, where the key is the parameter name and the
        value is the corresponding value.
    """
    if output is None:
        return self._default_error(input)

    if not self.use_default_structured_output:
        output = remove_fences(output)

    try:
        pairs = orjson.loads(output)
    except orjson.JSONDecodeError:
        return self._default_error(input)

    pairs = pairs["pairs"] if self.use_default_structured_output else pairs

    return self._format_output(pairs, input)
_format_output(pairs, input)

解析响应,返回包含查询和答案的字典。

参数

名称 类型 描述 默认值
pairs Dict[str, Any]

从 LLM 的输出中解析的字典。

必需
input Dict[str, Any]

来自 LLM 的输入。

必需

返回

类型 描述
Dict[str, Any]

格式化输出,其中 queries 是字符串列表,answers

Dict[str, Any]

是对象列表。

源代码位于 src/distilabel/steps/tasks/apigen/generator.py
def _format_output(
    self, pairs: Dict[str, Any], input: Dict[str, Any]
) -> Dict[str, Any]:
    """Parses the response, returning a dictionary with queries and answers.

    Args:
        pairs: The parsed dictionary from the LLM's output.
        input: The input from the `LLM`.

    Returns:
        Formatted output, where the `queries` are a list of strings, and the `answers`
        are a list of objects.
    """
    try:
        input.update(
            **{
                "query": pairs[0]["query"],
                "answers": json.dumps(pairs[0]["answers"]),
            }
        )
        return input
    except Exception as e:
        self._logger.error(f"Error formatting output: {e}, pairs: '{pairs}'")
        return self._default_error(input)
_default_error(input)

返回默认错误输出,以在失败时填充响应。

源代码位于 src/distilabel/steps/tasks/apigen/generator.py
def _default_error(self, input: Dict[str, Any]) -> Dict[str, Any]:
    """Returns a default error output, to fill the responses in case of failure."""
    input.update(
        **{
            "query": None,
            "answers": json.dumps([None] * self._number),
        }
    )
    return input
get_structured_output()

创建要传递给 LLM 的 json 模式,以强制生成一个字典,该字典的输出可以直接解析为 python 字典。

该模式对应于以下内容

from typing import Dict, List
from pydantic import BaseModel


class Answer(BaseModel):
    name: str
    arguments: Dict[str, str]

class QueryAnswer(BaseModel):
    query: str
    answers: List[Answer]

class QueryAnswerPairs(BaseModel):
    pairs: List[QueryAnswer]

json.dumps(QueryAnswerPairs.model_json_schema(), indent=4)

返回

类型 描述
Dict[str, Any]

强制执行的响应的 JSON 模式。

源代码位于 src/distilabel/steps/tasks/apigen/generator.py
@override
def get_structured_output(self) -> Dict[str, Any]:
    """Creates the json schema to be passed to the LLM, to enforce generating
    a dictionary with the output which can be directly parsed as a python dictionary.

    The schema corresponds to the following:

    ```python
    from typing import Dict, List
    from pydantic import BaseModel


    class Answer(BaseModel):
        name: str
        arguments: Dict[str, str]

    class QueryAnswer(BaseModel):
        query: str
        answers: List[Answer]

    class QueryAnswerPairs(BaseModel):
        pairs: List[QueryAnswer]

    json.dumps(QueryAnswerPairs.model_json_schema(), indent=4)
    ```

    Returns:
        JSON Schema of the response to enforce.
    """
    return {
        "$defs": {
            "Answer": {
                "properties": {
                    "name": {"title": "Name", "type": "string"},
                    "arguments": {
                        "additionalProperties": {"type": "string"},
                        "title": "Arguments",
                        "type": "object",
                    },
                },
                "required": ["name", "arguments"],
                "title": "Answer",
                "type": "object",
            },
            "QueryAnswer": {
                "properties": {
                    "query": {"title": "Query", "type": "string"},
                    "answers": {
                        "items": {"$ref": "#/$defs/Answer"},
                        "title": "Answers",
                        "type": "array",
                    },
                },
                "required": ["query", "answers"],
                "title": "QueryAnswer",
                "type": "object",
            },
        },
        "properties": {
            "pairs": {
                "items": {"$ref": "#/$defs/QueryAnswer"},
                "title": "Pairs",
                "type": "array",
            }
        },
        "required": ["pairs"],
        "title": "QueryAnswerPairs",
        "type": "object",
    }

APIGenSemanticChecker

基类:Task

以 JSON 格式为给定函数生成查询和答案。

APIGenGenerator 的灵感来自 APIGen pipeline,该 pipeline 旨在生成可验证和多样化的函数调用数据集。该任务以 JSON 格式为给定函数生成一组多样化的查询和相应的答案。

属性

名称 类型 描述
system_prompt str

任务的系统提示。有一个默认提示。

exclude_failed_execution str

是否排除失败的执行(不会在 keep_row_after_execution_check 列中包含 False 的行上运行,该列来自运行 APIGenExecutionChecker)。默认为 True。

输入列
  • func_desc (str): 函数应执行的操作的描述。
  • query (str): 用户的指令。
  • answers (str): JSON 编码的列表,其中包含要传递给函数/API 的参数。应使用 json.loads 加载。
  • execution_result (str): 执行的函数/API 的结果。
输出列
  • thought (str): 关于是否保留此输出的推理。
  • keep_row_after_semantic_check (bool): True 或 False,可用于后续过滤。
类别
  • 过滤
  • 文本生成
参考

示例

Semantic checker for generated function calls (original implementation):

```python
from distilabel.steps.tasks import APIGenSemanticChecker
from distilabel.models import InferenceEndpointsLLM

llm=InferenceEndpointsLLM(
    model_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
    generation_kwargs={
        "temperature": 0.7,
        "max_new_tokens": 1024,
    },
)
semantic_checker = APIGenSemanticChecker(
    use_default_structured_output=False,
    llm=llm
)
semantic_checker.load()

res = next(
    semantic_checker.process(
        [
            {
                "func_desc": "Fetch information about a specific cat breed from the Cat Breeds API.",
                "query": "What information can be obtained about the Maine Coon cat breed?",
                "answers": json.dumps([{"name": "get_breed_information", "arguments": {"breed": "Maine Coon"}}]),
                "execution_result": "The Maine Coon is a big and hairy breed of cat",
            }
        ]
    )
)
res
# [{'func_desc': 'Fetch information about a specific cat breed from the Cat Breeds API.',
# 'query': 'What information can be obtained about the Maine Coon cat breed?',
# 'answers': [{"name": "get_breed_information", "arguments": {"breed": "Maine Coon"}}],
# 'execution_result': 'The Maine Coon is a big and hairy breed of cat',
# 'thought': '',
# 'keep_row_after_semantic_check': True,
# 'raw_input_a_p_i_gen_semantic_checker_0': [{'role': 'system',
#     'content': 'As a data quality evaluator, you must assess the alignment between a user query, corresponding function calls, and their execution results.\nThese function calls and results are generated by other models, and your task is to ensure these results accurately reflect the user’s intentions.\n\nDo not pass if:\n1. The function call does not align with the query’s objective, or the input arguments appear incorrect.\n2. The function call and arguments are not properly chosen from the available functions.\n3. The number of function calls does not correspond to the user’s intentions.\n4. The execution results are irrelevant and do not match the function’s purpose.\n5. The execution results contain errors or reflect that the function calls were not executed successfully.\n'},
#     {'role': 'user',
#     'content': 'Given Information:\n- All Available Functions:\nFetch information about a specific cat breed from the Cat Breeds API.\n- User Query: What information can be obtained about the Maine Coon cat breed?\n- Generated Function Calls: [{"name": "get_breed_information", "arguments": {"breed": "Maine Coon"}}]\n- Execution Results: The Maine Coon is a big and hairy breed of cat\n\nNote: The query may have multiple intentions. Functions may be placeholders, and execution results may be truncated due to length, which is acceptable and should not cause a failure.\n\nThe main decision factor is wheather the function calls accurately reflect the query\'s intentions and the function descriptions.\nProvide your reasoning in the thought section and decide if the data passes (answer yes or no).\nIf not passing, concisely explain your reasons in the thought section; otherwise, leave this section blank.\n\nYour response MUST strictly adhere to the following JSON format, and NO other text MUST be included.\n```\n{\n   "thought": "Concisely describe your reasoning here",\n   "pass": "yes" or "no"\n}\n```\n'}]},
# 'model_name': 'meta-llama/Meta-Llama-3.1-70B-Instruct'}]
```

Semantic checker for generated function calls (structured output):

```python
from distilabel.steps.tasks import APIGenSemanticChecker
from distilabel.models import InferenceEndpointsLLM

llm=InferenceEndpointsLLM(
    model_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
    generation_kwargs={
        "temperature": 0.7,
        "max_new_tokens": 1024,
    },
)
semantic_checker = APIGenSemanticChecker(
    use_default_structured_output=True,
    llm=llm
)
semantic_checker.load()

res = next(
    semantic_checker.process(
        [
            {
                "func_desc": "Fetch information about a specific cat breed from the Cat Breeds API.",
                "query": "What information can be obtained about the Maine Coon cat breed?",
                "answers": json.dumps([{"name": "get_breed_information", "arguments": {"breed": "Maine Coon"}}]),
                "execution_result": "The Maine Coon is a big and hairy breed of cat",
            }
        ]
    )
)
res
# [{'func_desc': 'Fetch information about a specific cat breed from the Cat Breeds API.',
# 'query': 'What information can be obtained about the Maine Coon cat breed?',
# 'answers': [{"name": "get_breed_information", "arguments": {"breed": "Maine Coon"}}],
# 'execution_result': 'The Maine Coon is a big and hairy breed of cat',
# 'keep_row_after_semantic_check': True,
# 'thought': '',
# 'raw_input_a_p_i_gen_semantic_checker_0': [{'role': 'system',
#     'content': 'As a data quality evaluator, you must assess the alignment between a user query, corresponding function calls, and their execution results.\nThese function calls and results are generated by other models, and your task is to ensure these results accurately reflect the user’s intentions.\n\nDo not pass if:\n1. The function call does not align with the query’s objective, or the input arguments appear incorrect.\n2. The function call and arguments are not properly chosen from the available functions.\n3. The number of function calls does not correspond to the user’s intentions.\n4. The execution results are irrelevant and do not match the function’s purpose.\n5. The execution results contain errors or reflect that the function calls were not executed successfully.\n'},
#     {'role': 'user',
#     'content': 'Given Information:\n- All Available Functions:\nFetch information about a specific cat breed from the Cat Breeds API.\n- User Query: What information can be obtained about the Maine Coon cat breed?\n- Generated Function Calls: [{"name": "get_breed_information", "arguments": {"breed": "Maine Coon"}}]\n- Execution Results: The Maine Coon is a big and hairy breed of cat\n\nNote: The query may have multiple intentions. Functions may be placeholders, and execution results may be truncated due to length, which is acceptable and should not cause a failure.\n\nThe main decision factor is wheather the function calls accurately reflect the query\'s intentions and the function descriptions.\nProvide your reasoning in the thought section and decide if the data passes (answer yes or no).\nIf not passing, concisely explain your reasons in the thought section; otherwise, leave this section blank.\n'}]},
# 'model_name': 'meta-llama/Meta-Llama-3.1-70B-Instruct'}]
```
源代码位于 src/distilabel/steps/tasks/apigen/semantic_checker.py
class APIGenSemanticChecker(Task):
    r"""Generate queries and answers for the given functions in JSON format.

    The `APIGenGenerator` is inspired by the APIGen pipeline, which was designed to generate
    verifiable and diverse function-calling datasets. The task generates a set of diverse queries
    and corresponding answers for the given functions in JSON format.

    Attributes:
        system_prompt: System prompt for the task. Has a default one.
        exclude_failed_execution: Whether to exclude failed executions (won't run on those
            rows that have a False in `keep_row_after_execution_check` column, which
            comes from running `APIGenExecutionChecker`). Defaults to True.

    Input columns:
        - func_desc (`str`): Description of what the function should do.
        - query (`str`): Instruction from the user.
        - answers (`str`): JSON encoded list with arguments to be passed to the function/API.
            Should be loaded using `json.loads`.
        - execution_result (`str`): Result of the function/API executed.

    Output columns:
        - thought (`str`): Reasoning for the output on whether to keep this output or not.
        - keep_row_after_semantic_check (`bool`): True or False, can be used to filter
            afterwards.

    Categories:
        - filtering
        - text-generation

    References:
        - [APIGen: Automated Pipeline for Generating Verifiable and Diverse Function-Calling Datasets](https://arxiv.org/abs/2406.18518)
        - [Salesforce/xlam-function-calling-60k](https://hugging-face.cn/datasets/Salesforce/xlam-function-calling-60k)

    Examples:

        Semantic checker for generated function calls (original implementation):

        ```python
        from distilabel.steps.tasks import APIGenSemanticChecker
        from distilabel.models import InferenceEndpointsLLM

        llm=InferenceEndpointsLLM(
            model_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
            generation_kwargs={
                "temperature": 0.7,
                "max_new_tokens": 1024,
            },
        )
        semantic_checker = APIGenSemanticChecker(
            use_default_structured_output=False,
            llm=llm
        )
        semantic_checker.load()

        res = next(
            semantic_checker.process(
                [
                    {
                        "func_desc": "Fetch information about a specific cat breed from the Cat Breeds API.",
                        "query": "What information can be obtained about the Maine Coon cat breed?",
                        "answers": json.dumps([{"name": "get_breed_information", "arguments": {"breed": "Maine Coon"}}]),
                        "execution_result": "The Maine Coon is a big and hairy breed of cat",
                    }
                ]
            )
        )
        res
        # [{'func_desc': 'Fetch information about a specific cat breed from the Cat Breeds API.',
        # 'query': 'What information can be obtained about the Maine Coon cat breed?',
        # 'answers': [{"name": "get_breed_information", "arguments": {"breed": "Maine Coon"}}],
        # 'execution_result': 'The Maine Coon is a big and hairy breed of cat',
        # 'thought': '',
        # 'keep_row_after_semantic_check': True,
        # 'raw_input_a_p_i_gen_semantic_checker_0': [{'role': 'system',
        #     'content': 'As a data quality evaluator, you must assess the alignment between a user query, corresponding function calls, and their execution results.\nThese function calls and results are generated by other models, and your task is to ensure these results accurately reflect the user’s intentions.\n\nDo not pass if:\n1. The function call does not align with the query’s objective, or the input arguments appear incorrect.\n2. The function call and arguments are not properly chosen from the available functions.\n3. The number of function calls does not correspond to the user’s intentions.\n4. The execution results are irrelevant and do not match the function’s purpose.\n5. The execution results contain errors or reflect that the function calls were not executed successfully.\n'},
        #     {'role': 'user',
        #     'content': 'Given Information:\n- All Available Functions:\nFetch information about a specific cat breed from the Cat Breeds API.\n- User Query: What information can be obtained about the Maine Coon cat breed?\n- Generated Function Calls: [{"name": "get_breed_information", "arguments": {"breed": "Maine Coon"}}]\n- Execution Results: The Maine Coon is a big and hairy breed of cat\n\nNote: The query may have multiple intentions. Functions may be placeholders, and execution results may be truncated due to length, which is acceptable and should not cause a failure.\n\nThe main decision factor is wheather the function calls accurately reflect the query\'s intentions and the function descriptions.\nProvide your reasoning in the thought section and decide if the data passes (answer yes or no).\nIf not passing, concisely explain your reasons in the thought section; otherwise, leave this section blank.\n\nYour response MUST strictly adhere to the following JSON format, and NO other text MUST be included.\n```\n{\n   "thought": "Concisely describe your reasoning here",\n   "pass": "yes" or "no"\n}\n```\n'}]},
        # 'model_name': 'meta-llama/Meta-Llama-3.1-70B-Instruct'}]
        ```

        Semantic checker for generated function calls (structured output):

        ```python
        from distilabel.steps.tasks import APIGenSemanticChecker
        from distilabel.models import InferenceEndpointsLLM

        llm=InferenceEndpointsLLM(
            model_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
            generation_kwargs={
                "temperature": 0.7,
                "max_new_tokens": 1024,
            },
        )
        semantic_checker = APIGenSemanticChecker(
            use_default_structured_output=True,
            llm=llm
        )
        semantic_checker.load()

        res = next(
            semantic_checker.process(
                [
                    {
                        "func_desc": "Fetch information about a specific cat breed from the Cat Breeds API.",
                        "query": "What information can be obtained about the Maine Coon cat breed?",
                        "answers": json.dumps([{"name": "get_breed_information", "arguments": {"breed": "Maine Coon"}}]),
                        "execution_result": "The Maine Coon is a big and hairy breed of cat",
                    }
                ]
            )
        )
        res
        # [{'func_desc': 'Fetch information about a specific cat breed from the Cat Breeds API.',
        # 'query': 'What information can be obtained about the Maine Coon cat breed?',
        # 'answers': [{"name": "get_breed_information", "arguments": {"breed": "Maine Coon"}}],
        # 'execution_result': 'The Maine Coon is a big and hairy breed of cat',
        # 'keep_row_after_semantic_check': True,
        # 'thought': '',
        # 'raw_input_a_p_i_gen_semantic_checker_0': [{'role': 'system',
        #     'content': 'As a data quality evaluator, you must assess the alignment between a user query, corresponding function calls, and their execution results.\nThese function calls and results are generated by other models, and your task is to ensure these results accurately reflect the user’s intentions.\n\nDo not pass if:\n1. The function call does not align with the query’s objective, or the input arguments appear incorrect.\n2. The function call and arguments are not properly chosen from the available functions.\n3. The number of function calls does not correspond to the user’s intentions.\n4. The execution results are irrelevant and do not match the function’s purpose.\n5. The execution results contain errors or reflect that the function calls were not executed successfully.\n'},
        #     {'role': 'user',
        #     'content': 'Given Information:\n- All Available Functions:\nFetch information about a specific cat breed from the Cat Breeds API.\n- User Query: What information can be obtained about the Maine Coon cat breed?\n- Generated Function Calls: [{"name": "get_breed_information", "arguments": {"breed": "Maine Coon"}}]\n- Execution Results: The Maine Coon is a big and hairy breed of cat\n\nNote: The query may have multiple intentions. Functions may be placeholders, and execution results may be truncated due to length, which is acceptable and should not cause a failure.\n\nThe main decision factor is wheather the function calls accurately reflect the query\'s intentions and the function descriptions.\nProvide your reasoning in the thought section and decide if the data passes (answer yes or no).\nIf not passing, concisely explain your reasons in the thought section; otherwise, leave this section blank.\n'}]},
        # 'model_name': 'meta-llama/Meta-Llama-3.1-70B-Instruct'}]
        ```
    """

    system_prompt: str = SYSTEM_PROMPT_SEMANTIC_CHECKER
    use_default_structured_output: bool = False

    _format_inst: Union[str, None] = PrivateAttr(None)

    def load(self) -> None:
        """Loads the template for the generator prompt."""
        super().load()
        _path = str(
            importlib_resources.files("distilabel")
            / "steps"
            / "tasks"
            / "templates"
            / "apigen"
            / "semantic_checker.jinja2"
        )

        self._template = Template(open(_path).read())
        self._format_inst = self._set_format_inst()

    def _set_format_inst(self) -> str:
        """Prepares the function to generate the formatted instructions for the prompt.

        If the default structured output is used, returns an empty string because nothing
        else is needed, otherwise, returns the original addition to the prompt to guide the model
        to generate a formatted JSON.
        """
        return (
            "\nYour response MUST strictly adhere to the following JSON format, and NO other text MUST be included.\n"
            "```\n"
            "{\n"
            '   "thought": "Concisely describe your reasoning here",\n'
            '   "passes": "yes" or "no"\n'
            "}\n"
            "```\n"
        )

    @property
    def inputs(self) -> "StepColumns":
        """The inputs for the task."""
        return {
            "func_desc": True,
            "query": True,
            "answers": True,
            "execution_result": True,
            "keep_row_after_execution_check": True,
        }

    def format_input(self, input: Dict[str, Any]) -> "ChatType":
        """The input is formatted as a `ChatType`."""
        return [
            {"role": "system", "content": self.system_prompt},
            {
                "role": "user",
                "content": self._template.render(
                    func_desc=input["func_desc"],
                    query=input["query"] or "",
                    func_call=input["answers"] or "",
                    execution_result=input["execution_result"],
                    format_inst=self._format_inst,
                ),
            },
        ]

    @property
    def outputs(self) -> "StepColumns":
        """The output for the task are the queries and corresponding answers."""
        return ["keep_row_after_semantic_check", "thought"]

    def format_output(
        self, output: Union[str, None], input: Dict[str, Any]
    ) -> Dict[str, Any]:
        """The output is formatted as a list with the score of each instruction.

        Args:
            output: the raw output of the LLM.
            input: the input to the task. Used for obtaining the number of responses.

        Returns:
            A dict with the queries and answers pairs.
            The answers are an array of answers corresponding to the query.
            Each answer is represented as an object with the following properties:
                - name (string): The name of the tool used to generate the answer.
                - arguments (object): An object representing the arguments passed to the tool to generate the answer.
            Each argument is represented as a key-value pair, where the key is the parameter name and the
            value is the corresponding value.
        """
        if output is None:
            return self._default_error(input)

        output = remove_fences(output)

        try:
            result = orjson.loads(output)
            # Update the column name and change to bool
            result["keep_row_after_semantic_check"] = (
                result.pop("passes").lower() == "yes"
            )
            input.update(**result)
            return input
        except orjson.JSONDecodeError:
            return self._default_error(input)

    def _default_error(self, input: Dict[str, Any]) -> Dict[str, Any]:
        """Default error message for the task."""
        input.update({"thought": None, "keep_row_after_semantic_check": None})
        return input

    @override
    def get_structured_output(self) -> Dict[str, Any]:
        """Creates the json schema to be passed to the LLM, to enforce generating
        a dictionary with the output which can be directly parsed as a python dictionary.

        The schema corresponds to the following:

        ```python
        from typing import Literal
        from pydantic import BaseModel
        import json

        class Checker(BaseModel):
            thought: str
            passes: Literal["yes", "no"]

        json.dumps(Checker.model_json_schema(), indent=4)
        ```

        Returns:
            JSON Schema of the response to enforce.
        """
        return {
            "properties": {
                "thought": {"title": "Thought", "type": "string"},
                "passes": {"enum": ["yes", "no"], "title": "Passes", "type": "string"},
            },
            "required": ["thought", "passes"],
            "title": "Checker",
            "type": "object",
        }
inputs property

任务的输入。

outputs property

任务的输出是查询和相应的答案。

load()

加载生成器提示的模板。

源代码位于 src/distilabel/steps/tasks/apigen/semantic_checker.py
def load(self) -> None:
    """Loads the template for the generator prompt."""
    super().load()
    _path = str(
        importlib_resources.files("distilabel")
        / "steps"
        / "tasks"
        / "templates"
        / "apigen"
        / "semantic_checker.jinja2"
    )

    self._template = Template(open(_path).read())
    self._format_inst = self._set_format_inst()
_set_format_inst()

准备函数以生成提示的格式化指令。

如果使用默认的结构化输出,则返回一个空字符串,因为不需要其他任何操作,否则,返回添加到提示的原始内容,以指导模型生成格式化的 JSON。

源代码位于 src/distilabel/steps/tasks/apigen/semantic_checker.py
def _set_format_inst(self) -> str:
    """Prepares the function to generate the formatted instructions for the prompt.

    If the default structured output is used, returns an empty string because nothing
    else is needed, otherwise, returns the original addition to the prompt to guide the model
    to generate a formatted JSON.
    """
    return (
        "\nYour response MUST strictly adhere to the following JSON format, and NO other text MUST be included.\n"
        "```\n"
        "{\n"
        '   "thought": "Concisely describe your reasoning here",\n'
        '   "passes": "yes" or "no"\n'
        "}\n"
        "```\n"
    )
format_input(input)

输入格式化为 ChatType

源代码位于 src/distilabel/steps/tasks/apigen/semantic_checker.py
def format_input(self, input: Dict[str, Any]) -> "ChatType":
    """The input is formatted as a `ChatType`."""
    return [
        {"role": "system", "content": self.system_prompt},
        {
            "role": "user",
            "content": self._template.render(
                func_desc=input["func_desc"],
                query=input["query"] or "",
                func_call=input["answers"] or "",
                execution_result=input["execution_result"],
                format_inst=self._format_inst,
            ),
        },
    ]
format_output(output, input)

输出格式化为包含每个指令分数的列表。

参数

名称 类型 描述 默认值
output Union[str, None]

LLM 的原始输出。

必需
input Dict[str, Any]

任务的输入。用于获取响应数量。

必需

返回

类型 描述
Dict[str, Any]

包含查询和答案对的字典。

Dict[str, Any]

答案是与查询对应的答案数组。

Dict[str, Any]

每个答案都表示为一个对象,具有以下属性:- name (string):用于生成答案的工具的名称。- arguments (object):表示传递给工具以生成答案的参数的对象。

Dict[str, Any]

每个参数都表示为键值对,其中键是参数名称,

Dict[str, Any]

值是对应的值。

源代码位于 src/distilabel/steps/tasks/apigen/semantic_checker.py
def format_output(
    self, output: Union[str, None], input: Dict[str, Any]
) -> Dict[str, Any]:
    """The output is formatted as a list with the score of each instruction.

    Args:
        output: the raw output of the LLM.
        input: the input to the task. Used for obtaining the number of responses.

    Returns:
        A dict with the queries and answers pairs.
        The answers are an array of answers corresponding to the query.
        Each answer is represented as an object with the following properties:
            - name (string): The name of the tool used to generate the answer.
            - arguments (object): An object representing the arguments passed to the tool to generate the answer.
        Each argument is represented as a key-value pair, where the key is the parameter name and the
        value is the corresponding value.
    """
    if output is None:
        return self._default_error(input)

    output = remove_fences(output)

    try:
        result = orjson.loads(output)
        # Update the column name and change to bool
        result["keep_row_after_semantic_check"] = (
            result.pop("passes").lower() == "yes"
        )
        input.update(**result)
        return input
    except orjson.JSONDecodeError:
        return self._default_error(input)
_default_error(input)

任务的默认错误消息。

源代码位于 src/distilabel/steps/tasks/apigen/semantic_checker.py
def _default_error(self, input: Dict[str, Any]) -> Dict[str, Any]:
    """Default error message for the task."""
    input.update({"thought": None, "keep_row_after_semantic_check": None})
    return input
get_structured_output()

创建要传递给 LLM 的 json 模式,以强制生成一个字典,该字典的输出可以直接解析为 python 字典。

该模式对应于以下内容

from typing import Literal
from pydantic import BaseModel
import json

class Checker(BaseModel):
    thought: str
    passes: Literal["yes", "no"]

json.dumps(Checker.model_json_schema(), indent=4)

返回

类型 描述
Dict[str, Any]

强制执行的响应的 JSON 模式。

源代码位于 src/distilabel/steps/tasks/apigen/semantic_checker.py
@override
def get_structured_output(self) -> Dict[str, Any]:
    """Creates the json schema to be passed to the LLM, to enforce generating
    a dictionary with the output which can be directly parsed as a python dictionary.

    The schema corresponds to the following:

    ```python
    from typing import Literal
    from pydantic import BaseModel
    import json

    class Checker(BaseModel):
        thought: str
        passes: Literal["yes", "no"]

    json.dumps(Checker.model_json_schema(), indent=4)
    ```

    Returns:
        JSON Schema of the response to enforce.
    """
    return {
        "properties": {
            "thought": {"title": "Thought", "type": "string"},
            "passes": {"enum": ["yes", "no"], "title": "Passes", "type": "string"},
        },
        "required": ["thought", "passes"],
        "title": "Checker",
        "type": "object",
    }

ArgillaLabeller

基类:Task

根据输入字段、示例记录和问题设置注释 Argilla 记录。

此任务旨在通过利用预训练的 LLM 来促进 Argilla 记录的注释。它使用系统提示来引导 LLM 理解输入字段、问题类型和问题设置。然后,该任务格式化输入数据并根据问题生成响应。响应会根据问题的值模型进行验证,并准备最终的建议以进行注释。

属性

名称 类型 描述
_template Union[Template, None]

用于格式化 LLM 输入的 Jinja2 模板。

输入列
  • record (argilla.Record): 要注释的记录。
  • fields (Optional[List[Dict[str, Any]]]): 输入字段的字段设置列表。
  • question (Optional[Dict[str, Any]]): 要回答的问题的问题设置。
  • example_records (Optional[List[Dict[str, Any]]]): 包含用于回答问题的响应的少量示例记录。
  • guidelines (Optional[str]): 注释任务的指南。
输出列
  • suggestion (Dict[str, Any]): 用于注释的最终建议。
类别
  • 文本分类
  • 评分器
  • 文本生成
参考

示例

使用相同的数据集和问题注释记录

import argilla as rg
from argilla import Suggestion
from distilabel.steps.tasks import ArgillaLabeller
from distilabel.models import InferenceEndpointsLLM

# Get information from Argilla dataset definition
dataset = rg.Dataset("my_dataset")
pending_records_filter = rg.Filter(("status", "==", "pending"))
completed_records_filter = rg.Filter(("status", "==", "completed"))
pending_records = list(
    dataset.records(
        query=rg.Query(filter=pending_records_filter),
        limit=5,
    )
)
example_records = list(
    dataset.records(
        query=rg.Query(filter=completed_records_filter),
        limit=5,
    )
)
field = dataset.settings.fields["text"]
question = dataset.settings.questions["label"]

# Initialize the labeller with the model and fields
labeller = ArgillaLabeller(
    llm=InferenceEndpointsLLM(
        model_id="mistralai/Mistral-7B-Instruct-v0.2",
    ),
    fields=[field],
    question=question,
    example_records=example_records,
    guidelines=dataset.guidelines
)
labeller.load()

# Process the pending records
result = next(
    labeller.process(
        [
            {
                "record": record
            } for record in pending_records
        ]
    )
)

# Add the suggestions to the records
for record, suggestion in zip(pending_records, result):
    record.suggestions.add(Suggestion(**suggestion["suggestion"]))

# Log the updated records
dataset.records.log(pending_records)

使用交替的数据集和问题注释记录

import argilla as rg
from distilabel.steps.tasks import ArgillaLabeller
from distilabel.models import InferenceEndpointsLLM

# Get information from Argilla dataset definition
dataset = rg.Dataset("my_dataset")
field = dataset.settings.fields["text"]
question = dataset.settings.questions["label"]
question2 = dataset.settings.questions["label2"]

# Initialize the labeller with the model and fields
labeller = ArgillaLabeller(
    llm=InferenceEndpointsLLM(
        model_id="mistralai/Mistral-7B-Instruct-v0.2",
    )
)
labeller.load()

# Process the record
record = next(dataset.records())
result = next(
    labeller.process(
        [
            {
                "record": record,
                "fields": [field],
                "question": question,
            },
            {
                "record": record,
                "fields": [field],
                "question": question2,
            }
        ]
    )
)

# Add the suggestions to the record
for suggestion in result:
    record.suggestions.add(rg.Suggestion(**suggestion["suggestion"]))

# Log the updated record
dataset.records.log([record])

覆盖默认提示和指令

import argilla as rg
from distilabel.steps.tasks import ArgillaLabeller
from distilabel.models import InferenceEndpointsLLM

# Overwrite default prompts and instructions
labeller = ArgillaLabeller(
    llm=InferenceEndpointsLLM(
        model_id="mistralai/Mistral-7B-Instruct-v0.2",
    ),
    system_prompt="You are an expert annotator and labelling assistant that understands complex domains and natural language processing.",
    question_to_label_instruction={
        "label_selection": "Select the appropriate label from the list of provided labels.",
        "multi_label_selection": "Select none, one or multiple labels from the list of provided labels.",
        "text": "Provide a text response to the question.",
        "rating": "Provide a rating for the question.",
    },
)
labeller.load()
源代码位于 src/distilabel/steps/tasks/argilla_labeller.py
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
class ArgillaLabeller(Task):
    """
    Annotate Argilla records based on input fields, example records and question settings.

    This task is designed to facilitate the annotation of Argilla records by leveraging a pre-trained LLM.
    It uses a system prompt that guides the LLM to understand the input fields, the question type,
    and the question settings. The task then formats the input data and generates a response based on the question.
    The response is validated against the question's value model, and the final suggestion is prepared for annotation.

    Attributes:
        _template: a Jinja2 template used to format the input for the LLM.

    Input columns:
        - record (`argilla.Record`): The record to be annotated.
        - fields (`Optional[List[Dict[str, Any]]]`): The list of field settings for the input fields.
        - question (`Optional[Dict[str, Any]]`): The question settings for the question to be answered.
        - example_records (`Optional[List[Dict[str, Any]]]`): The few shot example records with responses to be used to answer the question.
        - guidelines (`Optional[str]`): The guidelines for the annotation task.

    Output columns:
        - suggestion (`Dict[str, Any]`): The final suggestion for annotation.

    Categories:
        - text-classification
        - scorer
        - text-generation

    References:
        - [`Argilla: Argilla is a collaboration tool for AI engineers and domain experts to build high-quality datasets`](https://github.com/argilla-io/argilla/)

    Examples:
        Annotate a record with the same dataset and question:

        ```python
        import argilla as rg
        from argilla import Suggestion
        from distilabel.steps.tasks import ArgillaLabeller
        from distilabel.models import InferenceEndpointsLLM

        # Get information from Argilla dataset definition
        dataset = rg.Dataset("my_dataset")
        pending_records_filter = rg.Filter(("status", "==", "pending"))
        completed_records_filter = rg.Filter(("status", "==", "completed"))
        pending_records = list(
            dataset.records(
                query=rg.Query(filter=pending_records_filter),
                limit=5,
            )
        )
        example_records = list(
            dataset.records(
                query=rg.Query(filter=completed_records_filter),
                limit=5,
            )
        )
        field = dataset.settings.fields["text"]
        question = dataset.settings.questions["label"]

        # Initialize the labeller with the model and fields
        labeller = ArgillaLabeller(
            llm=InferenceEndpointsLLM(
                model_id="mistralai/Mistral-7B-Instruct-v0.2",
            ),
            fields=[field],
            question=question,
            example_records=example_records,
            guidelines=dataset.guidelines
        )
        labeller.load()

        # Process the pending records
        result = next(
            labeller.process(
                [
                    {
                        "record": record
                    } for record in pending_records
                ]
            )
        )

        # Add the suggestions to the records
        for record, suggestion in zip(pending_records, result):
            record.suggestions.add(Suggestion(**suggestion["suggestion"]))

        # Log the updated records
        dataset.records.log(pending_records)
        ```

        Annotate a record with alternating datasets and questions:

        ```python
        import argilla as rg
        from distilabel.steps.tasks import ArgillaLabeller
        from distilabel.models import InferenceEndpointsLLM

        # Get information from Argilla dataset definition
        dataset = rg.Dataset("my_dataset")
        field = dataset.settings.fields["text"]
        question = dataset.settings.questions["label"]
        question2 = dataset.settings.questions["label2"]

        # Initialize the labeller with the model and fields
        labeller = ArgillaLabeller(
            llm=InferenceEndpointsLLM(
                model_id="mistralai/Mistral-7B-Instruct-v0.2",
            )
        )
        labeller.load()

        # Process the record
        record = next(dataset.records())
        result = next(
            labeller.process(
                [
                    {
                        "record": record,
                        "fields": [field],
                        "question": question,
                    },
                    {
                        "record": record,
                        "fields": [field],
                        "question": question2,
                    }
                ]
            )
        )

        # Add the suggestions to the record
        for suggestion in result:
            record.suggestions.add(rg.Suggestion(**suggestion["suggestion"]))

        # Log the updated record
        dataset.records.log([record])
        ```

        Overwrite default prompts and instructions:

        ```python
        import argilla as rg
        from distilabel.steps.tasks import ArgillaLabeller
        from distilabel.models import InferenceEndpointsLLM

        # Overwrite default prompts and instructions
        labeller = ArgillaLabeller(
            llm=InferenceEndpointsLLM(
                model_id="mistralai/Mistral-7B-Instruct-v0.2",
            ),
            system_prompt="You are an expert annotator and labelling assistant that understands complex domains and natural language processing.",
            question_to_label_instruction={
                "label_selection": "Select the appropriate label from the list of provided labels.",
                "multi_label_selection": "Select none, one or multiple labels from the list of provided labels.",
                "text": "Provide a text response to the question.",
                "rating": "Provide a rating for the question.",
            },
        )
        labeller.load()
        ```
    """

    system_prompt: str = (
        "You are an expert annotator and labelling assistant that understands complex domains and natural language processing. "
        "You are given input fields and a question. "
        "You should create a valid JSON object as an response to the question based on the input fields. "
    )
    question_to_label_instruction: Dict[str, str] = {
        "label_selection": "Select the appropriate label for the fields from the list of optional labels.",
        "multi_label_selection": "Select none, one or multiple labels for the fields from the list of optional labels.",
        "text": "Provide a response to the question based on the fields.",
        "rating": "Provide a rating for the question based on the fields.",
    }
    example_records: Optional[
        RuntimeParameter[Union[List[Union[Dict[str, Any], BaseModel]], None]]
    ] = Field(
        default=None,
        description="The few shot serialized example records or `BaseModel`s with responses to be used to answer the question.",
    )
    fields: Optional[
        RuntimeParameter[Union[List[Union[BaseModel, Dict[str, Any]]], None]]
    ] = Field(
        default=None,
        description="The field serialized field settings or `BaseModel` for the fields to be used to answer the question.",
    )
    question: Optional[
        RuntimeParameter[
            Union[
                Dict[str, Any],
                BaseModel,
                None,
            ]
        ]
    ] = Field(
        default=None,
        description="The question serialized question settings or `BaseModel` for the question to be answered.",
    )
    guidelines: Optional[RuntimeParameter[str]] = Field(
        default=None,
        description="The guidelines for the annotation task.",
    )

    _template: Union[Template, None] = PrivateAttr(...)
    _client: Optional[Any] = PrivateAttr(None)

    def load(self) -> None:
        """Loads the Jinja2 template."""
        super().load()

        _path = str(
            importlib_resources.files("distilabel")
            / "steps"
            / "tasks"
            / "templates"
            / "argillalabeller.jinja2"
        )

        self._template = Template(open(_path).read())

    @property
    def inputs(self) -> Dict[str, bool]:
        return {
            "record": True,
            "fields": False,
            "question": False,
            "example_records": False,
            "guidelines": False,
        }

    def _format_record(
        self, record: Dict[str, Any], fields: List[Dict[str, Any]]
    ) -> str:
        """Format the record fields into a string.

        Args:
            record (Dict[str, Any]): The record to format.
            fields (List[Dict[str, Any]]): The fields to format.

        Returns:
            str: The formatted record fields.
        """
        output = []
        for field in fields:
            output.append(record.get("fields", {}).get(field.get("name", "")))
        return "fields: " + "\n".join(output)

    def _get_label_instruction(self, question: Dict[str, Any]) -> str:
        """Get the label instruction for the question.

        Args:
            question (Dict[str, Any]): The question to get the label instruction for.

        Returns:
            str: The label instruction for the question.
        """
        question_type = question["settings"]["type"]
        return self.question_to_label_instruction[question_type]

    def _format_question(self, question: Dict[str, Any]) -> str:
        """Format the question settings into a string.

        Args:
            question (Dict[str, Any]): The question to format.

        Returns:
            str: The formatted question.
        """
        output = []
        output.append(f"question: {self._get_label_instruction(question)}")
        if "options" in question.get("settings", {}):
            output.append(
                f"optional labels: {[option['value'] for option in question.get('settings', {}).get('options', [])]}"
            )
        return "\n".join(output)

    def _format_example_records(
        self,
        records: List[Dict[str, Any]],
        fields: List[Dict[str, Any]],
        question: Dict[str, Any],
    ) -> str:
        """Format the example records into a string.

        Args:
            records (List[Dict[str, Any]]): The records to format.
            fields (List[Dict[str, Any]]): The fields to format.
            question (Dict[str, Any]): The question to format.

        Returns:
            str: The formatted example records.
        """
        base = []
        for record in records:
            responses = record.get("responses", {})
            if responses.get(question["name"]):
                base.append(self._format_record(record, fields))
                value = responses[question["name"]][0]["value"]
                formatted_value = self._assign_value_to_question_value_model(
                    value, question
                )
                base.append(f"response: {formatted_value}")
                base.append("")
            else:
                warnings.warn(
                    f"Record {record} has no response for question {question['name']}. Skipping example record.",
                    stacklevel=2,
                )
        return "\n".join(base)

    def format_input(
        self,
        input: Dict[
            str,
            Union[
                Dict[str, Any],
                "Record",
                "TextField",
                "MultiLabelQuestion",
                "LabelQuestion",
                "RatingQuestion",
                "TextQuestion",
            ],
        ],
    ) -> "ChatType":
        """Format the input into a chat message.

        Args:
            input: The input to format.

        Returns:
            The formatted chat message.

        Raises:
            ValueError: If question or fields are not provided.
        """
        input_keys = list(self.inputs.keys())
        record = input[input_keys[0]]
        fields = input.get(input_keys[1], self.fields)
        question = input.get(input_keys[2], self.question)
        examples = input.get(input_keys[3], self.example_records)
        guidelines = input.get(input_keys[4], self.guidelines)

        if question is None:
            raise ValueError("Question must be provided.")
        if fields is None or any(field is None for field in fields):
            raise ValueError("Fields must be provided.")

        record = record.to_dict() if not isinstance(record, dict) else record
        question = question.serialize() if not isinstance(question, dict) else question
        fields = [
            field.serialize() if not isinstance(field, dict) else field
            for field in fields
        ]
        examples = (
            [
                example.to_dict() if not isinstance(example, dict) else example
                for example in examples
            ]
            if examples
            else None
        )

        formatted_fields = self._format_record(record, fields)
        formatted_question = self._format_question(question)
        formatted_examples = (
            self._format_example_records(examples, fields, question)
            if examples
            else False
        )

        prompt = self._template.render(
            fields=formatted_fields,
            question=formatted_question,
            examples=formatted_examples,
            guidelines=guidelines,
        )

        messages = []
        if self.system_prompt:
            messages.append({"role": "system", "content": self.system_prompt})
        messages.append({"role": "user", "content": prompt})
        return messages

    @property
    def outputs(self) -> List[str]:
        return ["suggestion"]

    def format_output(
        self, output: Union[str, None], input: Dict[str, Any]
    ) -> Dict[str, Any]:
        """Format the output into a dictionary.

        Args:
            output (Union[str, None]): The output to format.
            input (Dict[str, Any]): The input to format.

        Returns:
            Dict[str, Any]: The formatted output.
        """
        from argilla import Suggestion

        question: Union[
            Any,
            Dict[str, Any],
            LabelQuestion,
            MultiLabelQuestion,
            RatingQuestion,
            TextQuestion,
            None,
        ] = input.get(list(self.inputs.keys())[2], self.question) or self.question
        question = question.serialize() if not isinstance(question, dict) else question
        model = self._get_pydantic_model_of_structured_output(question)
        validated_output = model(**json.loads(output))
        value = self._get_value_from_question_value_model(validated_output)
        suggestion = Suggestion(
            value=value,
            question_name=question["name"],
            type="model",
            agent=self.llm.model_name,
        ).serialize()
        return {
            self.outputs[0]: {
                k: v
                for k, v in suggestion.items()
                if k in ["value", "question_name", "type", "agent"]
            }
        }

    def _set_llm_structured_output_for_question(self, question: Dict[str, Any]) -> None:
        runtime_parameters = self.llm._runtime_parameters
        runtime_parameters.update(
            {
                "structured_output": {
                    "format": "json",
                    "schema": self._get_pydantic_model_of_structured_output(question),
                },
            }
        )
        self.llm.set_runtime_parameters(runtime_parameters)

    @override
    def process(self, inputs: StepInput) -> "StepOutput":
        """Process the input through the task.

        Args:
            inputs (StepInput): The input to process.

        Returns:
            StepOutput: The output of the task.
        """

        question_list = [input.get("question", self.question) for input in inputs]
        fields_list = [input.get("fields", self.fields) for input in inputs]
        # check if any field for the field in fields is None
        for fields in fields_list:
            if any(field is None for field in fields):
                raise ValueError(
                    "Fields must be provided during init or through `process` method."
                )
        # check if any question is None
        if any(question is None for question in question_list):
            raise ValueError(
                "Question must be provided during init or through `process` method."
            )
        question_list = [
            question.serialize() if not isinstance(question, dict) else question
            for question in question_list
        ]
        if not all(question == question_list[0] for question in question_list):
            warnings.warn(
                "Not all questions are the same. Processing each question separately by setting the structured output for each question. This may impact performance.",
                stacklevel=2,
            )
            for input, question in zip(inputs, question_list):
                self._set_llm_structured_output_for_question(question)
                yield from super().process([input])
        else:
            question = question_list[0]
            self._set_llm_structured_output_for_question(question)
            yield from super().process(inputs)

    def _get_value_from_question_value_model(
        self, question_value_model: BaseModel
    ) -> Any:
        """Get the value from the question value model.

        Args:
            question_value_model (BaseModel): The question value model to get the value from.

        Returns:
            Any: The value from the question value model.
        """
        for attr in ["label", "labels", "rating", "text"]:
            if hasattr(question_value_model, attr):
                return getattr(question_value_model, attr)
        raise ValueError(f"Unsupported question type: {question_value_model}")

    def _assign_value_to_question_value_model(
        self, value: Any, question: Dict[str, Any]
    ) -> BaseModel:
        """Assign the value to the question value model.

        Args:
            value (Any): The value to assign.
            question (Dict[str, Any]): The question to assign the value to.

        Returns:
            BaseModel: The question value model with the assigned value.
        """
        question_value_model = self._get_pydantic_model_of_structured_output(question)
        for attr in ["label", "labels", "rating", "text"]:
            try:
                model_dict = {attr: value}
                question_value_model = question_value_model(**model_dict)
                return question_value_model.model_dump_json()
            except AttributeError:
                pass
        return value

    def _get_pydantic_model_of_structured_output(
        self,
        question: Dict[str, Any],
    ) -> BaseModel:
        """Get the Pydantic model of the structured output.

        Args:
            question (Dict[str, Any]): The question to get the Pydantic model of the structured output for.

        Returns:
            BaseModel: The Pydantic model of the structured output.
        """

        question_type = question["settings"]["type"]

        if question_type == "multi_label_selection":

            class QuestionValueModel(BaseModel):
                labels: Optional[List[str]] = Field(default_factory=list)

        elif question_type == "label_selection":

            class QuestionValueModel(BaseModel):
                label: str

        elif question_type == "text":

            class QuestionValueModel(BaseModel):
                text: str

        elif question_type == "rating":

            class QuestionValueModel(BaseModel):
                rating: int
        else:
            raise ValueError(f"Unsupported question type: {question}")

        return QuestionValueModel
load()

加载 Jinja2 模板。

源代码位于 src/distilabel/steps/tasks/argilla_labeller.py
def load(self) -> None:
    """Loads the Jinja2 template."""
    super().load()

    _path = str(
        importlib_resources.files("distilabel")
        / "steps"
        / "tasks"
        / "templates"
        / "argillalabeller.jinja2"
    )

    self._template = Template(open(_path).read())
_format_record(record, fields)

将记录字段格式化为字符串。

参数

名称 类型 描述 默认值
record Dict[str, Any]

要格式化的记录。

必需
fields List[Dict[str, Any]]

要格式化的字段。

必需

返回

名称 类型 描述
str str

格式化的记录字段。

源代码位于 src/distilabel/steps/tasks/argilla_labeller.py
def _format_record(
    self, record: Dict[str, Any], fields: List[Dict[str, Any]]
) -> str:
    """Format the record fields into a string.

    Args:
        record (Dict[str, Any]): The record to format.
        fields (List[Dict[str, Any]]): The fields to format.

    Returns:
        str: The formatted record fields.
    """
    output = []
    for field in fields:
        output.append(record.get("fields", {}).get(field.get("name", "")))
    return "fields: " + "\n".join(output)
_get_label_instruction(question)

获取问题标签指令。

参数

名称 类型 描述 默认值
question Dict[str, Any]

要获取标签指令的问题。

必需

返回

名称 类型 描述
str str

问题的标签指令。

源代码位于 src/distilabel/steps/tasks/argilla_labeller.py
def _get_label_instruction(self, question: Dict[str, Any]) -> str:
    """Get the label instruction for the question.

    Args:
        question (Dict[str, Any]): The question to get the label instruction for.

    Returns:
        str: The label instruction for the question.
    """
    question_type = question["settings"]["type"]
    return self.question_to_label_instruction[question_type]
_format_question(question)

将问题设置格式化为字符串。

参数

名称 类型 描述 默认值
question Dict[str, Any]

要格式化的问题。

必需

返回

名称 类型 描述
str str

格式化后的问题。

源代码位于 src/distilabel/steps/tasks/argilla_labeller.py
def _format_question(self, question: Dict[str, Any]) -> str:
    """Format the question settings into a string.

    Args:
        question (Dict[str, Any]): The question to format.

    Returns:
        str: The formatted question.
    """
    output = []
    output.append(f"question: {self._get_label_instruction(question)}")
    if "options" in question.get("settings", {}):
        output.append(
            f"optional labels: {[option['value'] for option in question.get('settings', {}).get('options', [])]}"
        )
    return "\n".join(output)
_format_example_records(records, fields, question)

将示例记录格式化为字符串。

参数

名称 类型 描述 默认值
records List[Dict[str, Any]]

要格式化的记录。

必需
fields List[Dict[str, Any]]

要格式化的字段。

必需
question Dict[str, Any]

要格式化的问题。

必需

返回

名称 类型 描述
str str

格式化后的示例记录。

源代码位于 src/distilabel/steps/tasks/argilla_labeller.py
def _format_example_records(
    self,
    records: List[Dict[str, Any]],
    fields: List[Dict[str, Any]],
    question: Dict[str, Any],
) -> str:
    """Format the example records into a string.

    Args:
        records (List[Dict[str, Any]]): The records to format.
        fields (List[Dict[str, Any]]): The fields to format.
        question (Dict[str, Any]): The question to format.

    Returns:
        str: The formatted example records.
    """
    base = []
    for record in records:
        responses = record.get("responses", {})
        if responses.get(question["name"]):
            base.append(self._format_record(record, fields))
            value = responses[question["name"]][0]["value"]
            formatted_value = self._assign_value_to_question_value_model(
                value, question
            )
            base.append(f"response: {formatted_value}")
            base.append("")
        else:
            warnings.warn(
                f"Record {record} has no response for question {question['name']}. Skipping example record.",
                stacklevel=2,
            )
    return "\n".join(base)
format_input(input)

将输入格式化为聊天消息。

参数

名称 类型 描述 默认值
input Dict[str, Union[Dict[str, Any], Record, TextField, MultiLabelQuestion, LabelQuestion, RatingQuestion, TextQuestion]]

要格式化的输入。

必需

返回

类型 描述
ChatType

格式化后的聊天消息。

引发

类型 描述
ValueError

如果未提供问题或字段。

源代码位于 src/distilabel/steps/tasks/argilla_labeller.py
def format_input(
    self,
    input: Dict[
        str,
        Union[
            Dict[str, Any],
            "Record",
            "TextField",
            "MultiLabelQuestion",
            "LabelQuestion",
            "RatingQuestion",
            "TextQuestion",
        ],
    ],
) -> "ChatType":
    """Format the input into a chat message.

    Args:
        input: The input to format.

    Returns:
        The formatted chat message.

    Raises:
        ValueError: If question or fields are not provided.
    """
    input_keys = list(self.inputs.keys())
    record = input[input_keys[0]]
    fields = input.get(input_keys[1], self.fields)
    question = input.get(input_keys[2], self.question)
    examples = input.get(input_keys[3], self.example_records)
    guidelines = input.get(input_keys[4], self.guidelines)

    if question is None:
        raise ValueError("Question must be provided.")
    if fields is None or any(field is None for field in fields):
        raise ValueError("Fields must be provided.")

    record = record.to_dict() if not isinstance(record, dict) else record
    question = question.serialize() if not isinstance(question, dict) else question
    fields = [
        field.serialize() if not isinstance(field, dict) else field
        for field in fields
    ]
    examples = (
        [
            example.to_dict() if not isinstance(example, dict) else example
            for example in examples
        ]
        if examples
        else None
    )

    formatted_fields = self._format_record(record, fields)
    formatted_question = self._format_question(question)
    formatted_examples = (
        self._format_example_records(examples, fields, question)
        if examples
        else False
    )

    prompt = self._template.render(
        fields=formatted_fields,
        question=formatted_question,
        examples=formatted_examples,
        guidelines=guidelines,
    )

    messages = []
    if self.system_prompt:
        messages.append({"role": "system", "content": self.system_prompt})
    messages.append({"role": "user", "content": prompt})
    return messages
format_output(output, input)

将输出格式化为字典。

参数

名称 类型 描述 默认值
output Union[str, None]

要格式化的输出。

必需
input Dict[str, Any]

要格式化的输入。

必需

返回

类型 描述
Dict[str, Any]

Dict[str, Any]: 格式化后的输出。

源代码位于 src/distilabel/steps/tasks/argilla_labeller.py
def format_output(
    self, output: Union[str, None], input: Dict[str, Any]
) -> Dict[str, Any]:
    """Format the output into a dictionary.

    Args:
        output (Union[str, None]): The output to format.
        input (Dict[str, Any]): The input to format.

    Returns:
        Dict[str, Any]: The formatted output.
    """
    from argilla import Suggestion

    question: Union[
        Any,
        Dict[str, Any],
        LabelQuestion,
        MultiLabelQuestion,
        RatingQuestion,
        TextQuestion,
        None,
    ] = input.get(list(self.inputs.keys())[2], self.question) or self.question
    question = question.serialize() if not isinstance(question, dict) else question
    model = self._get_pydantic_model_of_structured_output(question)
    validated_output = model(**json.loads(output))
    value = self._get_value_from_question_value_model(validated_output)
    suggestion = Suggestion(
        value=value,
        question_name=question["name"],
        type="model",
        agent=self.llm.model_name,
    ).serialize()
    return {
        self.outputs[0]: {
            k: v
            for k, v in suggestion.items()
            if k in ["value", "question_name", "type", "agent"]
        }
    }
process(inputs)

通过任务处理输入。

参数

名称 类型 描述 默认值
inputs StepInput

要处理的输入。

必需

返回

名称 类型 描述
StepOutput StepOutput

任务的输出。

源代码位于 src/distilabel/steps/tasks/argilla_labeller.py
@override
def process(self, inputs: StepInput) -> "StepOutput":
    """Process the input through the task.

    Args:
        inputs (StepInput): The input to process.

    Returns:
        StepOutput: The output of the task.
    """

    question_list = [input.get("question", self.question) for input in inputs]
    fields_list = [input.get("fields", self.fields) for input in inputs]
    # check if any field for the field in fields is None
    for fields in fields_list:
        if any(field is None for field in fields):
            raise ValueError(
                "Fields must be provided during init or through `process` method."
            )
    # check if any question is None
    if any(question is None for question in question_list):
        raise ValueError(
            "Question must be provided during init or through `process` method."
        )
    question_list = [
        question.serialize() if not isinstance(question, dict) else question
        for question in question_list
    ]
    if not all(question == question_list[0] for question in question_list):
        warnings.warn(
            "Not all questions are the same. Processing each question separately by setting the structured output for each question. This may impact performance.",
            stacklevel=2,
        )
        for input, question in zip(inputs, question_list):
            self._set_llm_structured_output_for_question(question)
            yield from super().process([input])
    else:
        question = question_list[0]
        self._set_llm_structured_output_for_question(question)
        yield from super().process(inputs)
_get_value_from_question_value_model(question_value_model)

从问题值模型中获取值。

参数

名称 类型 描述 默认值
question_value_model BaseModel

从中获取值的问题值模型。

必需

返回

名称 类型 描述
Any Any

来自问题值模型的值。

源代码位于 src/distilabel/steps/tasks/argilla_labeller.py
def _get_value_from_question_value_model(
    self, question_value_model: BaseModel
) -> Any:
    """Get the value from the question value model.

    Args:
        question_value_model (BaseModel): The question value model to get the value from.

    Returns:
        Any: The value from the question value model.
    """
    for attr in ["label", "labels", "rating", "text"]:
        if hasattr(question_value_model, attr):
            return getattr(question_value_model, attr)
    raise ValueError(f"Unsupported question type: {question_value_model}")
_assign_value_to_question_value_model(value, question)

将值分配给问题值模型。

参数

名称 类型 描述 默认值
value Any

要分配的值。

必需
question Dict[str, Any]

要将值分配到的问题。

必需

返回

名称 类型 描述
BaseModel BaseModel

具有已分配值的问题值模型。

源代码位于 src/distilabel/steps/tasks/argilla_labeller.py
def _assign_value_to_question_value_model(
    self, value: Any, question: Dict[str, Any]
) -> BaseModel:
    """Assign the value to the question value model.

    Args:
        value (Any): The value to assign.
        question (Dict[str, Any]): The question to assign the value to.

    Returns:
        BaseModel: The question value model with the assigned value.
    """
    question_value_model = self._get_pydantic_model_of_structured_output(question)
    for attr in ["label", "labels", "rating", "text"]:
        try:
            model_dict = {attr: value}
            question_value_model = question_value_model(**model_dict)
            return question_value_model.model_dump_json()
        except AttributeError:
            pass
    return value
_get_pydantic_model_of_structured_output(question)

获取结构化输出的 Pydantic 模型。

参数

名称 类型 描述 默认值
question Dict[str, Any]

为其获取结构化输出的 Pydantic 模型的问题。

必需

返回

名称 类型 描述
BaseModel BaseModel

结构化输出的 Pydantic 模型。

源代码位于 src/distilabel/steps/tasks/argilla_labeller.py
def _get_pydantic_model_of_structured_output(
    self,
    question: Dict[str, Any],
) -> BaseModel:
    """Get the Pydantic model of the structured output.

    Args:
        question (Dict[str, Any]): The question to get the Pydantic model of the structured output for.

    Returns:
        BaseModel: The Pydantic model of the structured output.
    """

    question_type = question["settings"]["type"]

    if question_type == "multi_label_selection":

        class QuestionValueModel(BaseModel):
            labels: Optional[List[str]] = Field(default_factory=list)

    elif question_type == "label_selection":

        class QuestionValueModel(BaseModel):
            label: str

    elif question_type == "text":

        class QuestionValueModel(BaseModel):
            text: str

    elif question_type == "rating":

        class QuestionValueModel(BaseModel):
            rating: int
    else:
        raise ValueError(f"Unsupported question type: {question}")

    return QuestionValueModel

CLAIR

基类:Task

人工智能修订的对比学习 (CLAIR)。

CLAIR 使用人工智能系统来最小限度地修订解决方案 A→A´,从而使得到的偏好 A preferred A’ 更具对比性和精确性。

输入列
  • task (str): 任务或指令。
  • student_solution (str): 待修订任务的答案。
输出列
  • revision (str): 修订后的文本。
  • rational (str): 提供修订的理由。
  • model_name (str): 用于生成修订和理由的模型的名称。
类别
  • preference
  • 文本生成
参考

示例

创建对比偏好对

from distilabel.steps.tasks import CLAIR
from distilabel.models import InferenceEndpointsLLM

llm=InferenceEndpointsLLM(
    model_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
    tokenizer_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
    generation_kwargs={
        "temperature": 0.7,
        "max_new_tokens": 4096,
    },
)
clair_task = CLAIR(llm=llm)

clair_task.load()

result = next(
    clair_task.process(
        [
            {
                "task": "How many gaps are there between the earth and the moon?",
                "student_solution": 'There are no gaps between the Earth and the Moon. The Moon is actually in a close orbit around the Earth, and it is held in place by gravity. The average distance between the Earth and the Moon is about 384,400 kilometers (238,900 miles), and this distance is known as the "lunar distance" or "lunar mean distance."\n\nThe Moon does not have a gap between it and the Earth because it is a natural satellite that is gravitationally bound to our planet. The Moon's orbit is elliptical, which means that its distance from the Earth varies slightly over the course of a month, but it always remains within a certain range.\n\nSo, to summarize, there are no gaps between the Earth and the Moon. The Moon is simply a satellite that orbits the Earth, and its distance from our planet varies slightly due to the elliptical shape of its orbit.'
            }
        ]
    )
)
# result
# [{'task': 'How many gaps are there between the earth and the moon?',
# 'student_solution': 'There are no gaps between the Earth and the Moon. The Moon is actually in a close orbit around the Earth, and it is held in place by gravity. The average distance between the Earth and the Moon is about 384,400 kilometers (238,900 miles), and this distance is known as the "lunar distance" or "lunar mean distance."\n\nThe Moon does not have a gap between it and the Earth because it is a natural satellite that is gravitationally bound to our planet. The Moon\'s orbit is elliptical, which means that its distance from the Earth varies slightly over the course of a month, but it always remains within a certain range.\n\nSo, to summarize, there are no gaps between the Earth and the Moon. The Moon is simply a satellite that orbits the Earth, and its distance from our planet varies slightly due to the elliptical shape of its orbit.',
# 'revision': 'There are no physical gaps or empty spaces between the Earth and the Moon. The Moon is actually in a close orbit around the Earth, and it is held in place by gravity. The average distance between the Earth and the Moon is about 384,400 kilometers (238,900 miles), and this distance is known as the "lunar distance" or "lunar mean distance."\n\nThe Moon does not have a significant separation or gap between it and the Earth because it is a natural satellite that is gravitationally bound to our planet. The Moon\'s orbit is elliptical, which means that its distance from the Earth varies slightly over the course of a month, but it always remains within a certain range. This variation in distance is a result of the Moon\'s orbital path, not the presence of any gaps.\n\nIn summary, the Moon\'s orbit is continuous, with no intervening gaps, and its distance from the Earth varies due to the elliptical shape of its orbit.',
# 'rational': 'The student\'s solution provides a clear and concise answer to the question. However, there are a few areas where it can be improved. Firstly, the term "gaps" can be misleading in this context. The student should clarify what they mean by "gaps." Secondly, the student provides some additional information about the Moon\'s orbit, which is correct but could be more clearly connected to the main point. Lastly, the student\'s conclusion could be more concise.',
# 'distilabel_metadata': {'raw_output_c_l_a_i_r_0': '{teacher_reasoning}: The student\'s solution provides a clear and concise answer to the question. However, there are a few areas where it can be improved. Firstly, the term "gaps" can be misleading in this context. The student should clarify what they mean by "gaps." Secondly, the student provides some additional information about the Moon\'s orbit, which is correct but could be more clearly connected to the main point. Lastly, the student\'s conclusion could be more concise.\n\n{corrected_student_solution}: There are no physical gaps or empty spaces between the Earth and the Moon. The Moon is actually in a close orbit around the Earth, and it is held in place by gravity. The average distance between the Earth and the Moon is about 384,400 kilometers (238,900 miles), and this distance is known as the "lunar distance" or "lunar mean distance."\n\nThe Moon does not have a significant separation or gap between it and the Earth because it is a natural satellite that is gravitationally bound to our planet. The Moon\'s orbit is elliptical, which means that its distance from the Earth varies slightly over the course of a month, but it always remains within a certain range. This variation in distance is a result of the Moon\'s orbital path, not the presence of any gaps.\n\nIn summary, the Moon\'s orbit is continuous, with no intervening gaps, and its distance from the Earth varies due to the elliptical shape of its orbit.',
# 'raw_input_c_l_a_i_r_0': [{'role': 'system',
#     'content': "You are a teacher and your task is to minimally improve a student's answer. I will give you a {task} and a {student_solution}. Your job is to revise the {student_solution} such that it is clearer, more correct, and more engaging. Copy all non-corrected parts of the student's answer. Do not allude to the {corrected_student_solution} being a revision or a correction in your final solution."},
#     {'role': 'user',
#     'content': '{task}: How many gaps are there between the earth and the moon?\n\n{student_solution}: There are no gaps between the Earth and the Moon. The Moon is actually in a close orbit around the Earth, and it is held in place by gravity. The average distance between the Earth and the Moon is about 384,400 kilometers (238,900 miles), and this distance is known as the "lunar distance" or "lunar mean distance."\n\nThe Moon does not have a gap between it and the Earth because it is a natural satellite that is gravitationally bound to our planet. The Moon\'s orbit is elliptical, which means that its distance from the Earth varies slightly over the course of a month, but it always remains within a certain range.\n\nSo, to summarize, there are no gaps between the Earth and the Moon. The Moon is simply a satellite that orbits the Earth, and its distance from our planet varies slightly due to the elliptical shape of its orbit.\n\n-----------------\n\nLet\'s first think step by step with a {teacher_reasoning} to decide how to improve the {student_solution}, then give the {corrected_student_solution}. Mention the {teacher_reasoning} and {corrected_student_solution} identifiers to structure your answer.'}]},
# 'model_name': 'meta-llama/Meta-Llama-3.1-70B-Instruct'}]

引用

```
@misc{doosterlinck2024anchoredpreferenceoptimizationcontrastive,
    title={Anchored Preference Optimization and Contrastive Revisions: Addressing Underspecification in Alignment},
    author={Karel D'Oosterlinck and Winnie Xu and Chris Develder and Thomas Demeester and Amanpreet Singh and Christopher Potts and Douwe Kiela and Shikib Mehri},
    year={2024},
    eprint={2408.06266},
    archivePrefix={arXiv},
    primaryClass={cs.LG},
    url={https://arxiv.org/abs/2408.06266},
}
```
源代码位于 src/distilabel/steps/tasks/clair.py
class CLAIR(Task):
    r"""Contrastive Learning from AI Revisions (CLAIR).

    CLAIR uses an AI system to minimally revise a solution A→A´ such that the resulting
    preference A `preferred` A’ is much more contrastive and precise.

    Input columns:
        - task (`str`): The task or instruction.
        - student_solution (`str`): An answer to the task that is to be revised.

    Output columns:
        - revision (`str`): The revised text.
        - rational (`str`): The rational for the provided revision.
        - model_name (`str`): The name of the model used to generate the revision and rational.

    Categories:
        - preference
        - text-generation

    References:
        - [`Anchored Preference Optimization and Contrastive Revisions: Addressing Underspecification in Alignment`](https://arxiv.org/abs/2408.06266v1)
        - [`APO and CLAIR - GitHub Repository`](https://github.com/ContextualAI/CLAIR_and_APO)

    Examples:
        Create contrastive preference pairs:

        ```python
        from distilabel.steps.tasks import CLAIR
        from distilabel.models import InferenceEndpointsLLM

        llm=InferenceEndpointsLLM(
            model_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
            tokenizer_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
            generation_kwargs={
                "temperature": 0.7,
                "max_new_tokens": 4096,
            },
        )
        clair_task = CLAIR(llm=llm)

        clair_task.load()

        result = next(
            clair_task.process(
                [
                    {
                        "task": "How many gaps are there between the earth and the moon?",
                        "student_solution": 'There are no gaps between the Earth and the Moon. The Moon is actually in a close orbit around the Earth, and it is held in place by gravity. The average distance between the Earth and the Moon is about 384,400 kilometers (238,900 miles), and this distance is known as the "lunar distance" or "lunar mean distance."\n\nThe Moon does not have a gap between it and the Earth because it is a natural satellite that is gravitationally bound to our planet. The Moon's orbit is elliptical, which means that its distance from the Earth varies slightly over the course of a month, but it always remains within a certain range.\n\nSo, to summarize, there are no gaps between the Earth and the Moon. The Moon is simply a satellite that orbits the Earth, and its distance from our planet varies slightly due to the elliptical shape of its orbit.'
                    }
                ]
            )
        )
        # result
        # [{'task': 'How many gaps are there between the earth and the moon?',
        # 'student_solution': 'There are no gaps between the Earth and the Moon. The Moon is actually in a close orbit around the Earth, and it is held in place by gravity. The average distance between the Earth and the Moon is about 384,400 kilometers (238,900 miles), and this distance is known as the "lunar distance" or "lunar mean distance."\n\nThe Moon does not have a gap between it and the Earth because it is a natural satellite that is gravitationally bound to our planet. The Moon\'s orbit is elliptical, which means that its distance from the Earth varies slightly over the course of a month, but it always remains within a certain range.\n\nSo, to summarize, there are no gaps between the Earth and the Moon. The Moon is simply a satellite that orbits the Earth, and its distance from our planet varies slightly due to the elliptical shape of its orbit.',
        # 'revision': 'There are no physical gaps or empty spaces between the Earth and the Moon. The Moon is actually in a close orbit around the Earth, and it is held in place by gravity. The average distance between the Earth and the Moon is about 384,400 kilometers (238,900 miles), and this distance is known as the "lunar distance" or "lunar mean distance."\n\nThe Moon does not have a significant separation or gap between it and the Earth because it is a natural satellite that is gravitationally bound to our planet. The Moon\'s orbit is elliptical, which means that its distance from the Earth varies slightly over the course of a month, but it always remains within a certain range. This variation in distance is a result of the Moon\'s orbital path, not the presence of any gaps.\n\nIn summary, the Moon\'s orbit is continuous, with no intervening gaps, and its distance from the Earth varies due to the elliptical shape of its orbit.',
        # 'rational': 'The student\'s solution provides a clear and concise answer to the question. However, there are a few areas where it can be improved. Firstly, the term "gaps" can be misleading in this context. The student should clarify what they mean by "gaps." Secondly, the student provides some additional information about the Moon\'s orbit, which is correct but could be more clearly connected to the main point. Lastly, the student\'s conclusion could be more concise.',
        # 'distilabel_metadata': {'raw_output_c_l_a_i_r_0': '{teacher_reasoning}: The student\'s solution provides a clear and concise answer to the question. However, there are a few areas where it can be improved. Firstly, the term "gaps" can be misleading in this context. The student should clarify what they mean by "gaps." Secondly, the student provides some additional information about the Moon\'s orbit, which is correct but could be more clearly connected to the main point. Lastly, the student\'s conclusion could be more concise.\n\n{corrected_student_solution}: There are no physical gaps or empty spaces between the Earth and the Moon. The Moon is actually in a close orbit around the Earth, and it is held in place by gravity. The average distance between the Earth and the Moon is about 384,400 kilometers (238,900 miles), and this distance is known as the "lunar distance" or "lunar mean distance."\n\nThe Moon does not have a significant separation or gap between it and the Earth because it is a natural satellite that is gravitationally bound to our planet. The Moon\'s orbit is elliptical, which means that its distance from the Earth varies slightly over the course of a month, but it always remains within a certain range. This variation in distance is a result of the Moon\'s orbital path, not the presence of any gaps.\n\nIn summary, the Moon\'s orbit is continuous, with no intervening gaps, and its distance from the Earth varies due to the elliptical shape of its orbit.',
        # 'raw_input_c_l_a_i_r_0': [{'role': 'system',
        #     'content': "You are a teacher and your task is to minimally improve a student's answer. I will give you a {task} and a {student_solution}. Your job is to revise the {student_solution} such that it is clearer, more correct, and more engaging. Copy all non-corrected parts of the student's answer. Do not allude to the {corrected_student_solution} being a revision or a correction in your final solution."},
        #     {'role': 'user',
        #     'content': '{task}: How many gaps are there between the earth and the moon?\n\n{student_solution}: There are no gaps between the Earth and the Moon. The Moon is actually in a close orbit around the Earth, and it is held in place by gravity. The average distance between the Earth and the Moon is about 384,400 kilometers (238,900 miles), and this distance is known as the "lunar distance" or "lunar mean distance."\n\nThe Moon does not have a gap between it and the Earth because it is a natural satellite that is gravitationally bound to our planet. The Moon\'s orbit is elliptical, which means that its distance from the Earth varies slightly over the course of a month, but it always remains within a certain range.\n\nSo, to summarize, there are no gaps between the Earth and the Moon. The Moon is simply a satellite that orbits the Earth, and its distance from our planet varies slightly due to the elliptical shape of its orbit.\n\n-----------------\n\nLet\'s first think step by step with a {teacher_reasoning} to decide how to improve the {student_solution}, then give the {corrected_student_solution}. Mention the {teacher_reasoning} and {corrected_student_solution} identifiers to structure your answer.'}]},
        # 'model_name': 'meta-llama/Meta-Llama-3.1-70B-Instruct'}]
        ```

    Citations:

        ```
        @misc{doosterlinck2024anchoredpreferenceoptimizationcontrastive,
            title={Anchored Preference Optimization and Contrastive Revisions: Addressing Underspecification in Alignment},
            author={Karel D'Oosterlinck and Winnie Xu and Chris Develder and Thomas Demeester and Amanpreet Singh and Christopher Potts and Douwe Kiela and Shikib Mehri},
            year={2024},
            eprint={2408.06266},
            archivePrefix={arXiv},
            primaryClass={cs.LG},
            url={https://arxiv.org/abs/2408.06266},
        }
        ```
    """

    system_prompt: str = SYSTEM_PROMPT
    _template: Union[Template, None] = PrivateAttr(...)

    def load(self) -> None:
        super().load()
        _path = str(
            importlib_resources.files("distilabel")
            / "steps"
            / "tasks"
            / "templates"
            / "clair.jinja2"
        )
        with open(_path, "r") as f:
            self._template = Template(f.read())

    @property
    def inputs(self) -> "StepColumns":
        return ["task", "student_solution"]

    @property
    def outputs(self) -> "StepColumns":
        return ["revision", "rational", "model_name"]

    def format_input(self, input: Dict[str, Any]) -> "ChatType":
        """The input is formatted as a `ChatType` assuming that the instruction
        is the first interaction from the user within a conversation."""
        return [
            {"role": "system", "content": self.system_prompt},
            {
                "role": "user",
                "content": self._template.render(
                    task=input["task"], student_solution=input["student_solution"]
                ),
            },
        ]

    def format_output(
        self, output: Union[str, None], input: Dict[str, Any]
    ) -> Dict[str, Any]:
        """The output is formatted as a list with the score of each instruction-response pair.

        Args:
            output: the raw output of the LLM.
            input: the input to the task. Used for obtaining the number of responses.

        Returns:
            A dict with the key `scores` containing the scores for each instruction-response pair.
        """
        if output is None:
            return self._default_error()

        return self._format_output(output)

    def _format_output(self, output: Union[str, None]) -> Dict[str, Any]:
        if "**Corrected Student Solution:**" in output:
            splits = output.split("**Corrected Student Solution:**")
        elif "{corrected_student_solution}:" in output:
            splits = output.split("{corrected_student_solution}:")
        elif "{corrected_student_solution}" in output:
            splits = output.split("{corrected_student_solution}")
        elif "**Worsened Student Solution:**" in output:
            splits = output.split("**Worsened Student Solution:**")
        elif "{worsened_student_solution}:" in output:
            splits = output.split("{worsened_student_solution}:")
        elif "{worsened_student_solution}" in output:
            splits = output.split("{worsened_student_solution}")
        else:
            splits = None

        # Safety check when the output doesn't follow the expected format
        if not splits:
            return self._default_error()

        if len(splits) >= 2:
            revision = splits[1]
            revision = revision.strip("\n\n").strip()  # noqa: B005

            rational = splits[0]
            if "{teacher_reasoning}" in rational:
                rational = rational.split("{teacher_reasoning}")[1].strip(":").strip()
            rational = rational.strip("\n\n").strip()  # noqa: B005
        else:
            return self._default_error()
        return {"revision": revision, "rational": rational}

    def _default_error(self) -> Dict[str, None]:
        return {"revision": None, "rational": None}
format_input(input)

输入被格式化为 ChatType,假设指令是用户在对话中的首次互动。

源代码位于 src/distilabel/steps/tasks/clair.py
def format_input(self, input: Dict[str, Any]) -> "ChatType":
    """The input is formatted as a `ChatType` assuming that the instruction
    is the first interaction from the user within a conversation."""
    return [
        {"role": "system", "content": self.system_prompt},
        {
            "role": "user",
            "content": self._template.render(
                task=input["task"], student_solution=input["student_solution"]
            ),
        },
    ]
format_output(output, input)

输出被格式化为一个列表,其中包含每个指令-响应对的分数。

参数

名称 类型 描述 默认值
output Union[str, None]

LLM 的原始输出。

必需
input Dict[str, Any]

任务的输入。用于获取响应数量。

必需

返回

类型 描述
Dict[str, Any]

一个字典,键为 scores,其中包含每个指令-响应对的分数。

源代码位于 src/distilabel/steps/tasks/clair.py
def format_output(
    self, output: Union[str, None], input: Dict[str, Any]
) -> Dict[str, Any]:
    """The output is formatted as a list with the score of each instruction-response pair.

    Args:
        output: the raw output of the LLM.
        input: the input to the task. Used for obtaining the number of responses.

    Returns:
        A dict with the key `scores` containing the scores for each instruction-response pair.
    """
    if output is None:
        return self._default_error()

    return self._format_output(output)

ComplexityScorer

基类:Task

使用 LLM 根据指令的复杂性对其进行评分。

ComplexityScorer 是一个预定义的任务,用于根据指令列表的复杂性对其进行排名。它是论文“什么使对齐数据良好?指令调优中自动数据选择的综合研究”中复杂性评分任务的实现。

属性

名称 类型 描述
_template Union[Template, None]

用于格式化 LLM 输入的 Jinja2 模板。

输入列
  • instructions (List[str]): 要评分的指令列表。
输出列
  • scores (List[float]): 每个指令的分数。
  • model_name (str): 用于生成分数的模型名称。
类别
  • 评分器
  • complexity
  • instruction
参考

示例

评估指令的复杂性

from distilabel.steps.tasks import ComplexityScorer
from distilabel.models import InferenceEndpointsLLM

# Consider this as a placeholder for your actual LLM.
scorer = ComplexityScorer(
    llm=InferenceEndpointsLLM(
        model_id="mistralai/Mistral-7B-Instruct-v0.2",
    )
)

scorer.load()

result = next(
    scorer.process(
        [{"instructions": ["plain instruction", "highly complex instruction"]}]
    )
)
# result
# [{'instructions': ['plain instruction', 'highly complex instruction'], 'model_name': 'test', 'scores': [1, 5], 'distilabel_metadata': {'raw_output_complexity_scorer_0': 'output'}}]

使用默认模式生成结构化输出

from distilabel.steps.tasks import ComplexityScorer
from distilabel.models import InferenceEndpointsLLM

# Consider this as a placeholder for your actual LLM.
scorer = ComplexityScorer(
    llm=InferenceEndpointsLLM(
        model_id="mistralai/Mistral-7B-Instruct-v0.2",
    ),
    use_default_structured_output=use_default_structured_output
)

scorer.load()

result = next(
    scorer.process(
        [{"instructions": ["plain instruction", "highly complex instruction"]}]
    )
)
# result
# [{'instructions': ['plain instruction', 'highly complex instruction'], 'model_name': 'test', 'scores': [1, 2], 'distilabel_metadata': {'raw_output_complexity_scorer_0': '{ \n  "scores": [\n    1, \n    2\n  ]\n}'}}]
引用
@misc{liu2024makesgooddataalignment,
    title={What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning},
    author={Wei Liu and Weihao Zeng and Keqing He and Yong Jiang and Junxian He},
    year={2024},
    eprint={2312.15685},
    archivePrefix={arXiv},
    primaryClass={cs.CL},
    url={https://arxiv.org/abs/2312.15685},
}
源代码位于 src/distilabel/steps/tasks/complexity_scorer.py
class ComplexityScorer(Task):
    """Score instructions based on their complexity using an `LLM`.

    `ComplexityScorer` is a pre-defined task used to rank a list of instructions based in
    their complexity. It's an implementation of the complexity score task from the paper
    'What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection
    in Instruction Tuning'.

    Attributes:
        _template: a Jinja2 template used to format the input for the LLM.

    Input columns:
        - instructions (`List[str]`): The list of instructions to be scored.

    Output columns:
        - scores (`List[float]`): The score for each instruction.
        - model_name (`str`): The model name used to generate the scores.

    Categories:
        - scorer
        - complexity
        - instruction

    References:
        - [`What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning`](https://arxiv.org/abs/2312.15685)

    Examples:
        Evaluate the complexity of your instructions:

        ```python
        from distilabel.steps.tasks import ComplexityScorer
        from distilabel.models import InferenceEndpointsLLM

        # Consider this as a placeholder for your actual LLM.
        scorer = ComplexityScorer(
            llm=InferenceEndpointsLLM(
                model_id="mistralai/Mistral-7B-Instruct-v0.2",
            )
        )

        scorer.load()

        result = next(
            scorer.process(
                [{"instructions": ["plain instruction", "highly complex instruction"]}]
            )
        )
        # result
        # [{'instructions': ['plain instruction', 'highly complex instruction'], 'model_name': 'test', 'scores': [1, 5], 'distilabel_metadata': {'raw_output_complexity_scorer_0': 'output'}}]
        ```

        Generate structured output with default schema:

        ```python
        from distilabel.steps.tasks import ComplexityScorer
        from distilabel.models import InferenceEndpointsLLM

        # Consider this as a placeholder for your actual LLM.
        scorer = ComplexityScorer(
            llm=InferenceEndpointsLLM(
                model_id="mistralai/Mistral-7B-Instruct-v0.2",
            ),
            use_default_structured_output=use_default_structured_output
        )

        scorer.load()

        result = next(
            scorer.process(
                [{"instructions": ["plain instruction", "highly complex instruction"]}]
            )
        )
        # result
        # [{'instructions': ['plain instruction', 'highly complex instruction'], 'model_name': 'test', 'scores': [1, 2], 'distilabel_metadata': {'raw_output_complexity_scorer_0': '{ \\n  "scores": [\\n    1, \\n    2\\n  ]\\n}'}}]
        ```

    Citations:
        ```
        @misc{liu2024makesgooddataalignment,
            title={What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning},
            author={Wei Liu and Weihao Zeng and Keqing He and Yong Jiang and Junxian He},
            year={2024},
            eprint={2312.15685},
            archivePrefix={arXiv},
            primaryClass={cs.CL},
            url={https://arxiv.org/abs/2312.15685},
        }
        ```
    """

    _template: Union[Template, None] = PrivateAttr(...)
    _can_be_used_with_offline_batch_generation = True

    def load(self) -> None:
        """Loads the Jinja2 template."""
        super().load()

        _path = str(
            importlib_resources.files("distilabel")
            / "steps"
            / "tasks"
            / "templates"
            / "complexity-scorer.jinja2"
        )

        self._template = Template(open(_path).read())

    @property
    def inputs(self) -> List[str]:
        """The inputs for the task are the `instructions`."""
        return ["instructions"]

    def format_input(self, input: Dict[str, Any]) -> "ChatType":
        """The input is formatted as a `ChatType` assuming that the instruction
        is the first interaction from the user within a conversation."""
        return [
            {
                "role": "user",
                "content": self._template.render(instructions=input["instructions"]),  # type: ignore
            }
        ]

    @property
    def outputs(self) -> List[str]:
        """The output for the task are: a list of `scores` containing the complexity score for each
        instruction in `instructions`, and the `model_name`."""
        return ["scores", "model_name"]

    def format_output(
        self, output: Union[str, None], input: Dict[str, Any]
    ) -> Dict[str, Any]:
        """The output is formatted as a list with the score of each instruction.

        Args:
            output: the raw output of the LLM.
            input: the input to the task. Used for obtaining the number of responses.

        Returns:
            A dict with the key `scores` containing the scores for each instruction.
        """
        if output is None:
            return {"scores": [None] * len(input["instructions"])}

        if self.use_default_structured_output:
            return self._format_structured_output(output, input)

        scores = []
        score_lines = output.split("\n")
        for i, line in enumerate(score_lines):
            match = _PARSE_SCORE_LINE_REGEX.search(line)
            score = float(match.group(1)) if match else None
            scores.append(score)
            if i == len(input["instructions"]) - 1:
                break
        return {"scores": scores}

    @override
    def get_structured_output(self) -> Dict[str, Any]:
        """Creates the json schema to be passed to the LLM, to enforce generating
        a dictionary with the output which can be directly parsed as a python dictionary.

        The schema corresponds to the following:

        ```python
        from pydantic import BaseModel
        from typing import List

        class SchemaComplexityScorer(BaseModel):
            scores: List[int]
        ```

        Returns:
            JSON Schema of the response to enforce.
        """
        return {
            "properties": {
                "scores": {
                    "items": {"type": "integer"},
                    "title": "Scores",
                    "type": "array",
                }
            },
            "required": ["scores"],
            "title": "SchemaComplexityScorer",
            "type": "object",
        }

    def _format_structured_output(
        self, output: str, input: Dict[str, Any]
    ) -> Dict[str, str]:
        """Parses the structured response, which should correspond to a dictionary
        with either `positive`, or `positive` and `negative` keys.

        Args:
            output: The output from the `LLM`.

        Returns:
            Formatted output.
        """
        try:
            return orjson.loads(output)
        except orjson.JSONDecodeError:
            return {"scores": [None] * len(input["instructions"])}

    @override
    def _sample_input(self) -> "ChatType":
        """Returns a sample input to be used in the `print` method.
        Tasks that don't adhere to a format input that returns a map of the type
        str -> str should override this method to return a sample input.
        """
        return self.format_input(
            {
                "instructions": [
                    f"<PLACEHOLDER_{f'GENERATION_{i}'.upper()}>" for i in range(2)
                ],
            }
        )
inputs property

任务的输入是 instructions

outputs property

任务的输出是:一个 scores 列表,其中包含 instructions 中每个指令的复杂性评分,以及 model_name

load()

加载 Jinja2 模板。

源代码位于 src/distilabel/steps/tasks/complexity_scorer.py
def load(self) -> None:
    """Loads the Jinja2 template."""
    super().load()

    _path = str(
        importlib_resources.files("distilabel")
        / "steps"
        / "tasks"
        / "templates"
        / "complexity-scorer.jinja2"
    )

    self._template = Template(open(_path).read())
format_input(input)

输入被格式化为 ChatType,假设指令是用户在对话中的首次互动。

源代码位于 src/distilabel/steps/tasks/complexity_scorer.py
def format_input(self, input: Dict[str, Any]) -> "ChatType":
    """The input is formatted as a `ChatType` assuming that the instruction
    is the first interaction from the user within a conversation."""
    return [
        {
            "role": "user",
            "content": self._template.render(instructions=input["instructions"]),  # type: ignore
        }
    ]
format_output(output, input)

输出格式化为包含每个指令分数的列表。

参数

名称 类型 描述 默认值
output Union[str, None]

LLM 的原始输出。

必需
input Dict[str, Any]

任务的输入。用于获取响应数量。

必需

返回

类型 描述
Dict[str, Any]

一个字典,键为 scores,其中包含每个指令的分数。

源代码位于 src/distilabel/steps/tasks/complexity_scorer.py
def format_output(
    self, output: Union[str, None], input: Dict[str, Any]
) -> Dict[str, Any]:
    """The output is formatted as a list with the score of each instruction.

    Args:
        output: the raw output of the LLM.
        input: the input to the task. Used for obtaining the number of responses.

    Returns:
        A dict with the key `scores` containing the scores for each instruction.
    """
    if output is None:
        return {"scores": [None] * len(input["instructions"])}

    if self.use_default_structured_output:
        return self._format_structured_output(output, input)

    scores = []
    score_lines = output.split("\n")
    for i, line in enumerate(score_lines):
        match = _PARSE_SCORE_LINE_REGEX.search(line)
        score = float(match.group(1)) if match else None
        scores.append(score)
        if i == len(input["instructions"]) - 1:
            break
    return {"scores": scores}
get_structured_output()

创建要传递给 LLM 的 json 模式,以强制生成一个字典,该字典的输出可以直接解析为 python 字典。

该模式对应于以下内容

from pydantic import BaseModel
from typing import List

class SchemaComplexityScorer(BaseModel):
    scores: List[int]

返回

类型 描述
Dict[str, Any]

强制执行的响应的 JSON 模式。

源代码位于 src/distilabel/steps/tasks/complexity_scorer.py
@override
def get_structured_output(self) -> Dict[str, Any]:
    """Creates the json schema to be passed to the LLM, to enforce generating
    a dictionary with the output which can be directly parsed as a python dictionary.

    The schema corresponds to the following:

    ```python
    from pydantic import BaseModel
    from typing import List

    class SchemaComplexityScorer(BaseModel):
        scores: List[int]
    ```

    Returns:
        JSON Schema of the response to enforce.
    """
    return {
        "properties": {
            "scores": {
                "items": {"type": "integer"},
                "title": "Scores",
                "type": "array",
            }
        },
        "required": ["scores"],
        "title": "SchemaComplexityScorer",
        "type": "object",
    }
_format_structured_output(output, input)

解析结构化响应,该响应应对应于一个字典,其中包含 positivepositivenegative 键。

参数

名称 类型 描述 默认值
output str

来自 LLM 的输出。

必需

返回

类型 描述
Dict[str, str]

格式化后的输出。

源代码位于 src/distilabel/steps/tasks/complexity_scorer.py
def _format_structured_output(
    self, output: str, input: Dict[str, Any]
) -> Dict[str, str]:
    """Parses the structured response, which should correspond to a dictionary
    with either `positive`, or `positive` and `negative` keys.

    Args:
        output: The output from the `LLM`.

    Returns:
        Formatted output.
    """
    try:
        return orjson.loads(output)
    except orjson.JSONDecodeError:
        return {"scores": [None] * len(input["instructions"])}
_sample_input()

返回要在 print 方法中使用的示例输入。不遵循返回 str -> str 类型映射的格式输入的任务应覆盖此方法以返回示例输入。

源代码位于 src/distilabel/steps/tasks/complexity_scorer.py
@override
def _sample_input(self) -> "ChatType":
    """Returns a sample input to be used in the `print` method.
    Tasks that don't adhere to a format input that returns a map of the type
    str -> str should override this method to return a sample input.
    """
    return self.format_input(
        {
            "instructions": [
                f"<PLACEHOLDER_{f'GENERATION_{i}'.upper()}>" for i in range(2)
            ],
        }
    )

EvolInstruct

基类:Task

使用 LLM 进化指令。

WizardLM:使大型语言模型能够遵循复杂指令

属性

名称 类型 描述
num_evolutions int

要执行的进化次数。

store_evolutions bool

是否存储所有进化过程,还是仅存储最后一个进化过程。默认为 False

generate_answers bool

是否为进化后的指令生成答案。默认为 False

include_original_instruction bool

是否在 evolved_instructions 输出列中包含原始指令。默认为 False

mutation_templates Dict[str, str]

用于进化指令的突变模板。默认为 utils.py 文件中提供的模板。

seed RuntimeParameter[int]

numpy 设置的种子,以便随机选择突变方法。默认为 42

运行时参数
  • seed:为 numpy 设置的种子,以便随机选择突变方法。
输入列
  • instruction (str): 要进化的指令。
输出列
  • evolved_instruction (str): 如果 store_evolutions=False,则为进化后的指令。
  • evolved_instructions (List[str]): 如果 store_evolutions=True,则为进化后的指令。
  • model_name (str): 用于进化指令的 LLM 的名称。
  • answer (str): 如果 generate_answers=Truestore_evolutions=False,则为进化后指令的答案。
  • answers (List[str]): 如果 generate_answers=Truestore_evolutions=True,则为进化后指令的答案。
类别
  • evol
  • instruction
参考

示例

使用 LLM 进化指令

from distilabel.steps.tasks import EvolInstruct
from distilabel.models import InferenceEndpointsLLM

# Consider this as a placeholder for your actual LLM.
evol_instruct = EvolInstruct(
    llm=InferenceEndpointsLLM(
        model_id="mistralai/Mistral-7B-Instruct-v0.2",
    ),
    num_evolutions=2,
)

evol_instruct.load()

result = next(evol_instruct.process([{"instruction": "common instruction"}]))
# result
# [{'instruction': 'common instruction', 'evolved_instruction': 'evolved instruction', 'model_name': 'model_name'}]

保留进化迭代

from distilabel.steps.tasks import EvolInstruct
from distilabel.models import InferenceEndpointsLLM

# Consider this as a placeholder for your actual LLM.
evol_instruct = EvolInstruct(
    llm=InferenceEndpointsLLM(
        model_id="mistralai/Mistral-7B-Instruct-v0.2",
    ),
    num_evolutions=2,
    store_evolutions=True,
)

evol_instruct.load()

result = next(evol_instruct.process([{"instruction": "common instruction"}]))
# result
# [
#     {
#         'instruction': 'common instruction',
#         'evolved_instructions': ['initial evolution', 'final evolution'],
#         'model_name': 'model_name'
#     }
# ]

在单个步骤中为指令生成答案

from distilabel.steps.tasks import EvolInstruct
from distilabel.models import InferenceEndpointsLLM

# Consider this as a placeholder for your actual LLM.
evol_instruct = EvolInstruct(
    llm=InferenceEndpointsLLM(
        model_id="mistralai/Mistral-7B-Instruct-v0.2",
    ),
    num_evolutions=2,
    generate_answers=True,
)

evol_instruct.load()

result = next(evol_instruct.process([{"instruction": "common instruction"}]))
# result
# [
#     {
#         'instruction': 'common instruction',
#         'evolved_instruction': 'evolved instruction',
#         'answer': 'answer to the instruction',
#         'model_name': 'model_name'
#     }
# ]
引用
@misc{xu2023wizardlmempoweringlargelanguage,
    title={WizardLM: Empowering Large Language Models to Follow Complex Instructions},
    author={Can Xu and Qingfeng Sun and Kai Zheng and Xiubo Geng and Pu Zhao and Jiazhan Feng and Chongyang Tao and Daxin Jiang},
    year={2023},
    eprint={2304.12244},
    archivePrefix={arXiv},
    primaryClass={cs.CL},
    url={https://arxiv.org/abs/2304.12244},
}
源代码位于 src/distilabel/steps/tasks/evol_instruct/base.py
class EvolInstruct(Task):
    """Evolve instructions using an `LLM`.

    WizardLM: Empowering Large Language Models to Follow Complex Instructions

    Attributes:
        num_evolutions: The number of evolutions to be performed.
        store_evolutions: Whether to store all the evolutions or just the last one. Defaults
            to `False`.
        generate_answers: Whether to generate answers for the evolved instructions. Defaults
            to `False`.
        include_original_instruction: Whether to include the original instruction in the
            `evolved_instructions` output column. Defaults to `False`.
        mutation_templates: The mutation templates to be used for evolving the instructions.
            Defaults to the ones provided in the `utils.py` file.
        seed: The seed to be set for `numpy` in order to randomly pick a mutation method.
            Defaults to `42`.

    Runtime parameters:
        - `seed`: The seed to be set for `numpy` in order to randomly pick a mutation method.

    Input columns:
        - instruction (`str`): The instruction to evolve.

    Output columns:
        - evolved_instruction (`str`): The evolved instruction if `store_evolutions=False`.
        - evolved_instructions (`List[str]`): The evolved instructions if `store_evolutions=True`.
        - model_name (`str`): The name of the LLM used to evolve the instructions.
        - answer (`str`): The answer to the evolved instruction if `generate_answers=True`
            and `store_evolutions=False`.
        - answers (`List[str]`): The answers to the evolved instructions if `generate_answers=True`
            and `store_evolutions=True`.

    Categories:
        - evol
        - instruction

    References:
        - [WizardLM: Empowering Large Language Models to Follow Complex Instructions](https://arxiv.org/abs/2304.12244)
        - [GitHub: h2oai/h2o-wizardlm](https://github.com/h2oai/h2o-wizardlm)

    Examples:
        Evolve an instruction using an LLM:

        ```python
        from distilabel.steps.tasks import EvolInstruct
        from distilabel.models import InferenceEndpointsLLM

        # Consider this as a placeholder for your actual LLM.
        evol_instruct = EvolInstruct(
            llm=InferenceEndpointsLLM(
                model_id="mistralai/Mistral-7B-Instruct-v0.2",
            ),
            num_evolutions=2,
        )

        evol_instruct.load()

        result = next(evol_instruct.process([{"instruction": "common instruction"}]))
        # result
        # [{'instruction': 'common instruction', 'evolved_instruction': 'evolved instruction', 'model_name': 'model_name'}]
        ```

        Keep the iterations of the evolutions:

        ```python
        from distilabel.steps.tasks import EvolInstruct
        from distilabel.models import InferenceEndpointsLLM

        # Consider this as a placeholder for your actual LLM.
        evol_instruct = EvolInstruct(
            llm=InferenceEndpointsLLM(
                model_id="mistralai/Mistral-7B-Instruct-v0.2",
            ),
            num_evolutions=2,
            store_evolutions=True,
        )

        evol_instruct.load()

        result = next(evol_instruct.process([{"instruction": "common instruction"}]))
        # result
        # [
        #     {
        #         'instruction': 'common instruction',
        #         'evolved_instructions': ['initial evolution', 'final evolution'],
        #         'model_name': 'model_name'
        #     }
        # ]
        ```

        Generate answers for the instructions in a single step:

        ```python
        from distilabel.steps.tasks import EvolInstruct
        from distilabel.models import InferenceEndpointsLLM

        # Consider this as a placeholder for your actual LLM.
        evol_instruct = EvolInstruct(
            llm=InferenceEndpointsLLM(
                model_id="mistralai/Mistral-7B-Instruct-v0.2",
            ),
            num_evolutions=2,
            generate_answers=True,
        )

        evol_instruct.load()

        result = next(evol_instruct.process([{"instruction": "common instruction"}]))
        # result
        # [
        #     {
        #         'instruction': 'common instruction',
        #         'evolved_instruction': 'evolved instruction',
        #         'answer': 'answer to the instruction',
        #         'model_name': 'model_name'
        #     }
        # ]
        ```

    Citations:
        ```
        @misc{xu2023wizardlmempoweringlargelanguage,
            title={WizardLM: Empowering Large Language Models to Follow Complex Instructions},
            author={Can Xu and Qingfeng Sun and Kai Zheng and Xiubo Geng and Pu Zhao and Jiazhan Feng and Chongyang Tao and Daxin Jiang},
            year={2023},
            eprint={2304.12244},
            archivePrefix={arXiv},
            primaryClass={cs.CL},
            url={https://arxiv.org/abs/2304.12244},
        }
        ```
    """

    num_evolutions: int
    store_evolutions: bool = False
    generate_answers: bool = False
    include_original_instruction: bool = False
    mutation_templates: Dict[str, str] = MUTATION_TEMPLATES

    seed: RuntimeParameter[int] = Field(
        default=42,
        description="As `numpy` is being used in order to randomly pick a mutation method, then is nice to seed a random seed.",
    )

    @property
    def inputs(self) -> List[str]:
        """The input for the task is the `instruction`."""
        return ["instruction"]

    def format_input(self, input: str) -> ChatType:  # type: ignore
        """The input is formatted as a `ChatType` assuming that the instruction
        is the first interaction from the user within a conversation. And the
        `system_prompt` is added as the first message if it exists."""
        return [{"role": "user", "content": input}]

    @property
    def outputs(self) -> List[str]:
        """The output for the task are the `evolved_instruction/s`, the `answer` if `generate_answers=True`
        and the `model_name`."""
        # TODO: having to define a `model_name` column every time as the `Task.outputs` is not ideal,
        # this could be handled always and the value could be included within the DAG validation when
        # a `Task` is used, since all the `Task` subclasses will have an `llm` with a `model_name` attr.
        _outputs = [
            (
                "evolved_instruction"
                if not self.store_evolutions
                else "evolved_instructions"
            ),
            "model_name",
        ]
        if self.generate_answers:
            _outputs.append("answer" if not self.store_evolutions else "answers")
        return _outputs

    @override
    def format_output(  # type: ignore
        self, instructions: Union[str, List[str]], answers: Optional[List[str]] = None
    ) -> Dict[str, Any]:  # type: ignore
        """The output for the task is a dict with: `evolved_instruction` or `evolved_instructions`,
        depending whether the value is either `False` or `True` for `store_evolutions`, respectively;
        `answer` if `generate_answers=True`; and, finally, the `model_name`.

        Args:
            instructions: The instructions to be included within the output.
            answers: The answers to be included within the output if `generate_answers=True`.

        Returns:
            If `store_evolutions=False` and `generate_answers=True` return {"evolved_instruction": ..., "model_name": ..., "answer": ...};
            if `store_evolutions=True` and `generate_answers=True` return {"evolved_instructions": ..., "model_name": ..., "answer": ...};
            if `store_evolutions=False` and `generate_answers=False` return {"evolved_instruction": ..., "model_name": ...};
            if `store_evolutions=True` and `generate_answers=False` return {"evolved_instructions": ..., "model_name": ...}.
        """
        _output = {}
        if not self.store_evolutions:
            _output["evolved_instruction"] = instructions[-1]
        else:
            _output["evolved_instructions"] = instructions

        if self.generate_answers and answers:
            if not self.store_evolutions:
                _output["answer"] = answers[-1]
            else:
                _output["answers"] = answers

        _output["model_name"] = self.llm.model_name
        return _output

    @property
    def mutation_templates_names(self) -> List[str]:
        """Returns the names i.e. keys of the provided `mutation_templates`."""
        return list(self.mutation_templates.keys())

    def _apply_random_mutation(self, instruction: str) -> str:
        """Applies a random mutation from the ones provided as part of the `mutation_templates`
        enum, and returns the provided instruction within the mutation prompt.

        Args:
            instruction: The instruction to be included within the mutation prompt.

        Returns:
            A random mutation prompt with the provided instruction.
        """
        mutation = np.random.choice(self.mutation_templates_names)
        return self.mutation_templates[mutation].replace("<PROMPT>", instruction)  # type: ignore

    def _evolve_instructions(self, inputs: "StepInput") -> List[List[str]]:
        """Evolves the instructions provided as part of the inputs of the task.

        Args:
            inputs: A list of Python dictionaries with the inputs of the task.

        Returns:
            A list where each item is a list with either the last evolved instruction if
            `store_evolutions=False` or all the evolved instructions if `store_evolutions=True`.
        """

        instructions: List[List[str]] = [[input["instruction"]] for input in inputs]
        statistics: "LLMStatistics" = defaultdict(list)

        for iter_no in range(self.num_evolutions):
            formatted_prompts = []
            for instruction in instructions:
                formatted_prompts.append(self._apply_random_mutation(instruction[-1]))

            formatted_prompts = [
                self.format_input(prompt) for prompt in formatted_prompts
            ]
            responses = self.llm.generate(
                formatted_prompts,
                **self.llm.generation_kwargs,  # type: ignore
            )
            generated_prompts = flatten_responses(
                [response["generations"] for response in responses]
            )
            for response in responses:
                for k, v in response["statistics"].items():
                    statistics[k].append(v[0])

            evolved_instructions = []
            for generated_prompt in generated_prompts:
                generated_prompt = generated_prompt.split("Prompt#:")[-1].strip()
                evolved_instructions.append(generated_prompt)

            if self.store_evolutions:
                instructions = [
                    instruction + [evolved_instruction]
                    for instruction, evolved_instruction in zip(
                        instructions, evolved_instructions
                    )
                ]
            else:
                instructions = [
                    [evolved_instruction]
                    for evolved_instruction in evolved_instructions
                ]

            self._logger.info(
                f"🔄 Ran iteration {iter_no} evolving {len(instructions)} instructions!"
            )
        return instructions, dict(statistics)

    def _generate_answers(
        self, evolved_instructions: List[List[str]]
    ) -> Tuple[List[List[str]], "LLMStatistics"]:
        """Generates the answer for the instructions in `instructions`.

        Args:
            evolved_instructions: A list of lists where each item is a list with either the last
                evolved instruction if `store_evolutions=False` or all the evolved instructions
                if `store_evolutions=True`.

        Returns:
            A list of answers for each instruction.
        """
        formatted_instructions = [
            self.format_input(instruction)
            for instructions in evolved_instructions
            for instruction in instructions
        ]

        responses = self.llm.generate(
            formatted_instructions,
            num_generations=1,
            **self.llm.generation_kwargs,  # type: ignore
        )
        generations = [response["generations"] for response in responses]

        statistics: Dict[str, Any] = defaultdict(list)
        for response in responses:
            for k, v in response["statistics"].items():
                statistics[k].append(v[0])

        step = (
            self.num_evolutions
            if not self.include_original_instruction
            else self.num_evolutions + 1
        )

        return [
            flatten_responses(generations[i : i + step])
            for i in range(0, len(responses), step)
        ], dict(statistics)

    @override
    def process(self, inputs: StepInput) -> "StepOutput":  # type: ignore
        """Processes the inputs of the task and generates the outputs using the LLM.

        Args:
            inputs: A list of Python dictionaries with the inputs of the task.

        Yields:
            A list of Python dictionaries with the outputs of the task.
        """

        evolved_instructions, statistics = self._evolve_instructions(inputs)

        if self.store_evolutions:
            # Remove the input instruction from the `evolved_instructions` list
            from_ = 1 if not self.include_original_instruction else 0
            evolved_instructions = [
                instruction[from_:] for instruction in evolved_instructions
            ]

        if not self.generate_answers:
            for input, instruction in zip(inputs, evolved_instructions):
                input.update(self.format_output(instruction))
                input.update(
                    {
                        "distilabel_metadata": {
                            f"statistics_instruction_{self.name}": statistics
                        }
                    }
                )
            yield inputs

        self._logger.info(
            f"🎉 Finished evolving {len(evolved_instructions)} instructions!"
        )

        if self.generate_answers:
            self._logger.info(
                f"🧠 Generating answers for the {len(evolved_instructions)} evolved instructions!"
            )

            answers, statistics = self._generate_answers(evolved_instructions)

            self._logger.info(
                f"🎉 Finished generating answers for the {len(evolved_instructions)} evolved"
                " instructions!"
            )

            for idx, (input, instruction) in enumerate(
                zip(inputs, evolved_instructions)
            ):
                input.update(self.format_output(instruction, answers[idx]))
                input.update(
                    {
                        "distilabel_metadata": {
                            f"statistics_answer_{self.name}": statistics
                        }
                    }
                )
            yield inputs

    @override
    def _sample_input(self) -> ChatType:
        return self.format_input(
            self._apply_random_mutation("<PLACEHOLDER_INSTRUCTION>")
        )
inputs property

任务的输入是 instruction

outputs property

任务的输出是 evolved_instruction/sanswer(如果 generate_answers=True)和 model_name

mutation_templates_names property

返回提供的 mutation_templates 的名称,即键。

format_input(input)

输入被格式化为 ChatType,假设指令是用户在对话中的首次互动。如果存在 system_prompt,则将其添加为第一条消息。

源代码位于 src/distilabel/steps/tasks/evol_instruct/base.py
def format_input(self, input: str) -> ChatType:  # type: ignore
    """The input is formatted as a `ChatType` assuming that the instruction
    is the first interaction from the user within a conversation. And the
    `system_prompt` is added as the first message if it exists."""
    return [{"role": "user", "content": input}]
format_output(instructions, answers=None)

任务的输出是一个字典,其中包含:evolved_instructionevolved_instructions,具体取决于 store_evolutions 的值为 False 还是 Trueanswer(如果 generate_answers=True);以及最后的 model_name

参数

名称 类型 描述 默认值
instructions Union[str, List[str]]

要包含在输出中的指令。

必需
answers Optional[List[str]]

如果 generate_answers=True,则要包含在输出中的答案。

None

返回

类型 描述
Dict[str, Any]

如果 store_evolutions=Falsegenerate_answers=True,则返回 {"evolved_instruction": ..., "model_name": ..., "answer": ...};

Dict[str, Any]

如果 store_evolutions=Truegenerate_answers=True,则返回 {"evolved_instructions": ..., "model_name": ..., "answer": ...};

Dict[str, Any]

如果 store_evolutions=Falsegenerate_answers=False,则返回 {"evolved_instruction": ..., "model_name": ...};

Dict[str, Any]

如果 store_evolutions=Truegenerate_answers=False,则返回 {"evolved_instructions": ..., "model_name": ...}。

源代码位于 src/distilabel/steps/tasks/evol_instruct/base.py
@override
def format_output(  # type: ignore
    self, instructions: Union[str, List[str]], answers: Optional[List[str]] = None
) -> Dict[str, Any]:  # type: ignore
    """The output for the task is a dict with: `evolved_instruction` or `evolved_instructions`,
    depending whether the value is either `False` or `True` for `store_evolutions`, respectively;
    `answer` if `generate_answers=True`; and, finally, the `model_name`.

    Args:
        instructions: The instructions to be included within the output.
        answers: The answers to be included within the output if `generate_answers=True`.

    Returns:
        If `store_evolutions=False` and `generate_answers=True` return {"evolved_instruction": ..., "model_name": ..., "answer": ...};
        if `store_evolutions=True` and `generate_answers=True` return {"evolved_instructions": ..., "model_name": ..., "answer": ...};
        if `store_evolutions=False` and `generate_answers=False` return {"evolved_instruction": ..., "model_name": ...};
        if `store_evolutions=True` and `generate_answers=False` return {"evolved_instructions": ..., "model_name": ...}.
    """
    _output = {}
    if not self.store_evolutions:
        _output["evolved_instruction"] = instructions[-1]
    else:
        _output["evolved_instructions"] = instructions

    if self.generate_answers and answers:
        if not self.store_evolutions:
            _output["answer"] = answers[-1]
        else:
            _output["answers"] = answers

    _output["model_name"] = self.llm.model_name
    return _output
_apply_random_mutation(instruction)

应用从作为 mutation_templates 枚举一部分提供的突变中随机选择的突变,并在突变提示中返回提供的指令。

参数

名称 类型 描述 默认值
instruction str

要包含在突变提示中的指令。

必需

返回

类型 描述
str

带有提供的指令的随机突变提示。

源代码位于 src/distilabel/steps/tasks/evol_instruct/base.py
def _apply_random_mutation(self, instruction: str) -> str:
    """Applies a random mutation from the ones provided as part of the `mutation_templates`
    enum, and returns the provided instruction within the mutation prompt.

    Args:
        instruction: The instruction to be included within the mutation prompt.

    Returns:
        A random mutation prompt with the provided instruction.
    """
    mutation = np.random.choice(self.mutation_templates_names)
    return self.mutation_templates[mutation].replace("<PROMPT>", instruction)  # type: ignore
_evolve_instructions(inputs)

进化作为任务输入一部分提供的指令。

参数

名称 类型 描述 默认值
inputs StepInput

包含任务输入的 Python 字典列表。

必需

返回

类型 描述
List[List[str]]

一个列表,其中每个项目都是一个列表,其中包含最后一个进化后的指令(如果

List[List[str]]

store_evolutions=False)或所有进化后的指令(如果 store_evolutions=True)。

源代码位于 src/distilabel/steps/tasks/evol_instruct/base.py
def _evolve_instructions(self, inputs: "StepInput") -> List[List[str]]:
    """Evolves the instructions provided as part of the inputs of the task.

    Args:
        inputs: A list of Python dictionaries with the inputs of the task.

    Returns:
        A list where each item is a list with either the last evolved instruction if
        `store_evolutions=False` or all the evolved instructions if `store_evolutions=True`.
    """

    instructions: List[List[str]] = [[input["instruction"]] for input in inputs]
    statistics: "LLMStatistics" = defaultdict(list)

    for iter_no in range(self.num_evolutions):
        formatted_prompts = []
        for instruction in instructions:
            formatted_prompts.append(self._apply_random_mutation(instruction[-1]))

        formatted_prompts = [
            self.format_input(prompt) for prompt in formatted_prompts
        ]
        responses = self.llm.generate(
            formatted_prompts,
            **self.llm.generation_kwargs,  # type: ignore
        )
        generated_prompts = flatten_responses(
            [response["generations"] for response in responses]
        )
        for response in responses:
            for k, v in response["statistics"].items():
                statistics[k].append(v[0])

        evolved_instructions = []
        for generated_prompt in generated_prompts:
            generated_prompt = generated_prompt.split("Prompt#:")[-1].strip()
            evolved_instructions.append(generated_prompt)

        if self.store_evolutions:
            instructions = [
                instruction + [evolved_instruction]
                for instruction, evolved_instruction in zip(
                    instructions, evolved_instructions
                )
            ]
        else:
            instructions = [
                [evolved_instruction]
                for evolved_instruction in evolved_instructions
            ]

        self._logger.info(
            f"🔄 Ran iteration {iter_no} evolving {len(instructions)} instructions!"
        )
    return instructions, dict(statistics)
_generate_answers(evolved_instructions)

instructions 中的指令生成答案。

参数

名称 类型 描述 默认值
evolved_instructions List[List[str]]

一个列表的列表,其中每个项目都是一个列表,其中包含最后一个进化后的指令(如果 store_evolutions=False)或所有进化后的指令(如果 store_evolutions=True)。

必需

返回

类型 描述
Tuple[List[List[str]], LLMStatistics]

每个指令的答案列表。

源代码位于 src/distilabel/steps/tasks/evol_instruct/base.py
def _generate_answers(
    self, evolved_instructions: List[List[str]]
) -> Tuple[List[List[str]], "LLMStatistics"]:
    """Generates the answer for the instructions in `instructions`.

    Args:
        evolved_instructions: A list of lists where each item is a list with either the last
            evolved instruction if `store_evolutions=False` or all the evolved instructions
            if `store_evolutions=True`.

    Returns:
        A list of answers for each instruction.
    """
    formatted_instructions = [
        self.format_input(instruction)
        for instructions in evolved_instructions
        for instruction in instructions
    ]

    responses = self.llm.generate(
        formatted_instructions,
        num_generations=1,
        **self.llm.generation_kwargs,  # type: ignore
    )
    generations = [response["generations"] for response in responses]

    statistics: Dict[str, Any] = defaultdict(list)
    for response in responses:
        for k, v in response["statistics"].items():
            statistics[k].append(v[0])

    step = (
        self.num_evolutions
        if not self.include_original_instruction
        else self.num_evolutions + 1
    )

    return [
        flatten_responses(generations[i : i + step])
        for i in range(0, len(responses), step)
    ], dict(statistics)
process(inputs)

处理任务的输入,并使用 LLM 生成输出。

参数

名称 类型 描述 默认值
inputs StepInput

包含任务输入的 Python 字典列表。

必需

产生

类型 描述
StepOutput

包含任务输出的 Python 字典列表。

源代码位于 src/distilabel/steps/tasks/evol_instruct/base.py
@override
def process(self, inputs: StepInput) -> "StepOutput":  # type: ignore
    """Processes the inputs of the task and generates the outputs using the LLM.

    Args:
        inputs: A list of Python dictionaries with the inputs of the task.

    Yields:
        A list of Python dictionaries with the outputs of the task.
    """

    evolved_instructions, statistics = self._evolve_instructions(inputs)

    if self.store_evolutions:
        # Remove the input instruction from the `evolved_instructions` list
        from_ = 1 if not self.include_original_instruction else 0
        evolved_instructions = [
            instruction[from_:] for instruction in evolved_instructions
        ]

    if not self.generate_answers:
        for input, instruction in zip(inputs, evolved_instructions):
            input.update(self.format_output(instruction))
            input.update(
                {
                    "distilabel_metadata": {
                        f"statistics_instruction_{self.name}": statistics
                    }
                }
            )
        yield inputs

    self._logger.info(
        f"🎉 Finished evolving {len(evolved_instructions)} instructions!"
    )

    if self.generate_answers:
        self._logger.info(
            f"🧠 Generating answers for the {len(evolved_instructions)} evolved instructions!"
        )

        answers, statistics = self._generate_answers(evolved_instructions)

        self._logger.info(
            f"🎉 Finished generating answers for the {len(evolved_instructions)} evolved"
            " instructions!"
        )

        for idx, (input, instruction) in enumerate(
            zip(inputs, evolved_instructions)
        ):
            input.update(self.format_output(instruction, answers[idx]))
            input.update(
                {
                    "distilabel_metadata": {
                        f"statistics_answer_{self.name}": statistics
                    }
                }
            )
        yield inputs

EvolComplexity

基类:EvolInstruct

使用 LLM 进化指令以使其更复杂。

EvolComplexity 是一个进化指令以使其更复杂的任务,它基于 EvolInstruct 任务,使用略有不同的提示,但进化方法完全相同。

属性

名称 类型 描述
num_instructions

要生成的指令数量。

generate_answers bool

是否为指令生成答案。默认为 False

mutation_templates Dict[str, str]

用于生成指令的突变模板。

min_length Dict[str, str]

定义生成的指令需要高于的长度(以字节为单位),才能被视为有效。默认为 512

max_length Dict[str, str]

定义生成的指令需要低于的长度(以字节为单位),才能被视为有效。默认为 1024

seed RuntimeParameter[int]

numpy 设置的种子,以便随机选择突变方法。默认为 42

运行时参数
  • min_length:定义生成的指令需要高于的长度(以字节为单位),才能被视为有效。
  • max_length:定义生成的指令需要低于的长度(以字节为单位),才能被视为有效。
  • seed:要运行的进化次数。
输入列
  • instruction (str): 要进化的指令。
输出列
  • evolved_instruction (str): 进化后的指令。
  • answer (str, optional): 如果 generate_answers=True,则为指令的答案。
  • model_name (str): 用于进化指令的 LLM 的名称。
类别
  • evol
  • instruction
  • deita
参考

示例

使用 LLM 进化指令

from distilabel.steps.tasks import EvolComplexity
from distilabel.models import InferenceEndpointsLLM

# Consider this as a placeholder for your actual LLM.
evol_complexity = EvolComplexity(
    llm=InferenceEndpointsLLM(
        model_id="mistralai/Mistral-7B-Instruct-v0.2",
    ),
    num_evolutions=2,
)

evol_complexity.load()

result = next(evol_complexity.process([{"instruction": "common instruction"}]))
# result
# [{'instruction': 'common instruction', 'evolved_instruction': 'evolved instruction', 'model_name': 'model_name'}]
引用
@misc{liu2024makesgooddataalignment,
    title={What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning},
    author={Wei Liu and Weihao Zeng and Keqing He and Yong Jiang and Junxian He},
    year={2024},
    eprint={2312.15685},
    archivePrefix={arXiv},
    primaryClass={cs.CL},
    url={https://arxiv.org/abs/2312.15685},
}
@misc{xu2023wizardlmempoweringlargelanguage,
    title={WizardLM: Empowering Large Language Models to Follow Complex Instructions},
    author={Can Xu and Qingfeng Sun and Kai Zheng and Xiubo Geng and Pu Zhao and Jiazhan Feng and Chongyang Tao and Daxin Jiang},
    year={2023},
    eprint={2304.12244},
    archivePrefix={arXiv},
    primaryClass={cs.CL},
    url={https://arxiv.org/abs/2304.12244},
}
源代码位于 src/distilabel/steps/tasks/evol_instruct/evol_complexity/base.py
class EvolComplexity(EvolInstruct):
    """Evolve instructions to make them more complex using an `LLM`.

    `EvolComplexity` is a task that evolves instructions to make them more complex,
    and it is based in the EvolInstruct task, using slight different prompts, but the
    exact same evolutionary approach.

    Attributes:
        num_instructions: The number of instructions to be generated.
        generate_answers: Whether to generate answers for the instructions or not. Defaults
            to `False`.
        mutation_templates: The mutation templates to be used for the generation of the
            instructions.
        min_length: Defines the length (in bytes) that the generated instruction needs to
            be higher than, to be considered valid. Defaults to `512`.
        max_length: Defines the length (in bytes) that the generated instruction needs to
            be lower than, to be considered valid. Defaults to `1024`.
        seed: The seed to be set for `numpy` in order to randomly pick a mutation method.
            Defaults to `42`.

    Runtime parameters:
        - `min_length`: Defines the length (in bytes) that the generated instruction needs to be higher than, to be considered valid.
        - `max_length`: Defines the length (in bytes) that the generated instruction needs to be lower than, to be considered valid.
        - `seed`: The number of evolutions to be run.

    Input columns:
        - instruction (`str`): The instruction to evolve.

    Output columns:
        - evolved_instruction (`str`): The evolved instruction.
        - answer (`str`, optional): The answer to the instruction if `generate_answers=True`.
        - model_name (`str`): The name of the LLM used to evolve the instructions.

    Categories:
        - evol
        - instruction
        - deita

    References:
        - [What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning](https://arxiv.org/abs/2312.15685)
        - [WizardLM: Empowering Large Language Models to Follow Complex Instructions](https://arxiv.org/abs/2304.12244)

    Examples:
        Evolve an instruction using an LLM:

        ```python
        from distilabel.steps.tasks import EvolComplexity
        from distilabel.models import InferenceEndpointsLLM

        # Consider this as a placeholder for your actual LLM.
        evol_complexity = EvolComplexity(
            llm=InferenceEndpointsLLM(
                model_id="mistralai/Mistral-7B-Instruct-v0.2",
            ),
            num_evolutions=2,
        )

        evol_complexity.load()

        result = next(evol_complexity.process([{"instruction": "common instruction"}]))
        # result
        # [{'instruction': 'common instruction', 'evolved_instruction': 'evolved instruction', 'model_name': 'model_name'}]
        ```

    Citations:
        ```
        @misc{liu2024makesgooddataalignment,
            title={What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning},
            author={Wei Liu and Weihao Zeng and Keqing He and Yong Jiang and Junxian He},
            year={2024},
            eprint={2312.15685},
            archivePrefix={arXiv},
            primaryClass={cs.CL},
            url={https://arxiv.org/abs/2312.15685},
        }
        ```

        ```
        @misc{xu2023wizardlmempoweringlargelanguage,
            title={WizardLM: Empowering Large Language Models to Follow Complex Instructions},
            author={Can Xu and Qingfeng Sun and Kai Zheng and Xiubo Geng and Pu Zhao and Jiazhan Feng and Chongyang Tao and Daxin Jiang},
            year={2023},
            eprint={2304.12244},
            archivePrefix={arXiv},
            primaryClass={cs.CL},
            url={https://arxiv.org/abs/2304.12244},
        }
        ```
    """

    mutation_templates: Dict[str, str] = MUTATION_TEMPLATES

EvolComplexityGenerator

基类:EvolInstructGenerator

使用 LLM 生成复杂度增加的进化指令。

EvolComplexityGenerator 是一个生成任务,它进化指令以使其更复杂,它基于 EvolInstruct 任务,但使用略有不同的提示,但进化方法完全相同。

属性

名称 类型 描述
num_instructions int

要生成的指令数量。

generate_answers bool

是否为指令生成答案。默认为 False

mutation_templates Dict[str, str]

用于生成指令的突变模板。

min_length RuntimeParameter[int]

定义生成的指令需要高于的长度(以字节为单位),才能被视为有效。默认为 512

max_length RuntimeParameter[int]

定义生成的指令需要低于的长度(以字节为单位),才能被视为有效。默认为 1024

seed RuntimeParameter[int]

numpy 设置的种子,以便随机选择突变方法。默认为 42

运行时参数
  • min_length:定义生成的指令需要高于的长度(以字节为单位),才能被视为有效。
  • max_length:定义生成的指令需要低于的长度(以字节为单位),才能被视为有效。
  • seed:要运行的进化次数。
输出列
  • instruction (str): 进化后的指令。
  • answer (str, optional): 如果 generate_answers=True,则为指令的答案。
  • model_name (str): 用于进化指令的 LLM 的名称。
类别
  • evol
  • instruction
  • generation
  • deita
参考

示例

生成没有初始指令的进化指令

from distilabel.steps.tasks import EvolComplexityGenerator
from distilabel.models import InferenceEndpointsLLM

# Consider this as a placeholder for your actual LLM.
evol_complexity_generator = EvolComplexityGenerator(
    llm=InferenceEndpointsLLM(
        model_id="mistralai/Mistral-7B-Instruct-v0.2",
    ),
    num_instructions=2,
)

evol_complexity_generator.load()

result = next(scorer.process())
# result
# [{'instruction': 'generated instruction', 'model_name': 'test'}]
引用
@misc{liu2024makesgooddataalignment,
    title={What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning},
    author={Wei Liu and Weihao Zeng and Keqing He and Yong Jiang and Junxian He},
    year={2024},
    eprint={2312.15685},
    archivePrefix={arXiv},
    primaryClass={cs.CL},
    url={https://arxiv.org/abs/2312.15685},
}
@misc{xu2023wizardlmempoweringlargelanguage,
    title={WizardLM: Empowering Large Language Models to Follow Complex Instructions},
    author={Can Xu and Qingfeng Sun and Kai Zheng and Xiubo Geng and Pu Zhao and Jiazhan Feng and Chongyang Tao and Daxin Jiang},
    year={2023},
    eprint={2304.12244},
    archivePrefix={arXiv},
    primaryClass={cs.CL},
    url={https://arxiv.org/abs/2304.12244},
}
源代码位于 src/distilabel/steps/tasks/evol_instruct/evol_complexity/generator.py
class EvolComplexityGenerator(EvolInstructGenerator):
    """Generate evolved instructions with increased complexity using an `LLM`.

    `EvolComplexityGenerator` is a generation task that evolves instructions to make
    them more complex, and it is based in the EvolInstruct task, but using slight different
    prompts, but the exact same evolutionary approach.

    Attributes:
        num_instructions: The number of instructions to be generated.
        generate_answers: Whether to generate answers for the instructions or not. Defaults
            to `False`.
        mutation_templates: The mutation templates to be used for the generation of the
            instructions.
        min_length: Defines the length (in bytes) that the generated instruction needs to
            be higher than, to be considered valid. Defaults to `512`.
        max_length: Defines the length (in bytes) that the generated instruction needs to
            be lower than, to be considered valid. Defaults to `1024`.
        seed: The seed to be set for `numpy` in order to randomly pick a mutation method.
            Defaults to `42`.

    Runtime parameters:
        - `min_length`: Defines the length (in bytes) that the generated instruction needs to be higher than, to be considered valid.
        - `max_length`: Defines the length (in bytes) that the generated instruction needs to be lower than, to be considered valid.
        - `seed`: The number of evolutions to be run.

    Output columns:
        - instruction (`str`): The evolved instruction.
        - answer (`str`, optional): The answer to the instruction if `generate_answers=True`.
        - model_name (`str`): The name of the LLM used to evolve the instructions.

    Categories:
        - evol
        - instruction
        - generation
        - deita

    References:
        - [What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning](https://arxiv.org/abs/2312.15685)
        - [WizardLM: Empowering Large Language Models to Follow Complex Instructions](https://arxiv.org/abs/2304.12244)

    Examples:
        Generate evolved instructions without initial instructions:

        ```python
        from distilabel.steps.tasks import EvolComplexityGenerator
        from distilabel.models import InferenceEndpointsLLM

        # Consider this as a placeholder for your actual LLM.
        evol_complexity_generator = EvolComplexityGenerator(
            llm=InferenceEndpointsLLM(
                model_id="mistralai/Mistral-7B-Instruct-v0.2",
            ),
            num_instructions=2,
        )

        evol_complexity_generator.load()

        result = next(scorer.process())
        # result
        # [{'instruction': 'generated instruction', 'model_name': 'test'}]
        ```

    Citations:
        ```
        @misc{liu2024makesgooddataalignment,
            title={What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning},
            author={Wei Liu and Weihao Zeng and Keqing He and Yong Jiang and Junxian He},
            year={2024},
            eprint={2312.15685},
            archivePrefix={arXiv},
            primaryClass={cs.CL},
            url={https://arxiv.org/abs/2312.15685},
        }
        ```

        ```
        @misc{xu2023wizardlmempoweringlargelanguage,
            title={WizardLM: Empowering Large Language Models to Follow Complex Instructions},
            author={Can Xu and Qingfeng Sun and Kai Zheng and Xiubo Geng and Pu Zhao and Jiazhan Feng and Chongyang Tao and Daxin Jiang},
            year={2023},
            eprint={2304.12244},
            archivePrefix={arXiv},
            primaryClass={cs.CL},
            url={https://arxiv.org/abs/2304.12244},
        }
        ```
    """

    mutation_templates: Dict[str, str] = GENERATION_MUTATION_TEMPLATES

EvolInstructGenerator

基类:GeneratorTask

使用 LLM 生成进化指令。

WizardLM:使大型语言模型能够遵循复杂指令

属性

名称 类型 描述
num_instructions int

要生成的指令数量。

generate_answers bool

是否为指令生成答案。默认为 False

mutation_templates Dict[str, str]

用于生成指令的突变模板。

min_length RuntimeParameter[int]

定义生成的指令需要高于的长度(以字节为单位),才能被视为有效。默认为 512

max_length RuntimeParameter[int]

定义生成的指令需要低于的长度(以字节为单位),才能被视为有效。默认为 1024

seed RuntimeParameter[int]

numpy 设置的种子,以便随机选择突变方法。默认为 42

运行时参数
  • min_length:定义生成的指令需要高于的长度(以字节为单位),才能被视为有效。
  • max_length:定义生成的指令需要低于的长度(以字节为单位),才能被视为有效。
  • seed:为 numpy 设置的种子,以便随机选择突变方法。
输出列
  • instruction (str): 如果 generate_answers=False,则为生成的指令。
  • answer (str): 如果 generate_answers=True,则为生成的答案。
  • instructions (List[str]): 如果 generate_answers=True,则为生成的指令。
  • model_name (str): 用于生成和进化指令的 LLM 的名称。
类别
  • evol
  • instruction
  • generation
参考

示例

生成没有初始指令的进化指令

from distilabel.steps.tasks import EvolInstructGenerator
from distilabel.models import InferenceEndpointsLLM

# Consider this as a placeholder for your actual LLM.
evol_instruct_generator = EvolInstructGenerator(
    llm=InferenceEndpointsLLM(
        model_id="mistralai/Mistral-7B-Instruct-v0.2",
    ),
    num_instructions=2,
)

evol_instruct_generator.load()

result = next(scorer.process())
# result
# [{'instruction': 'generated instruction', 'model_name': 'test'}]
引用
@misc{xu2023wizardlmempoweringlargelanguage,
    title={WizardLM: Empowering Large Language Models to Follow Complex Instructions},
    author={Can Xu and Qingfeng Sun and Kai Zheng and Xiubo Geng and Pu Zhao and Jiazhan Feng and Chongyang Tao and Daxin Jiang},
    year={2023},
    eprint={2304.12244},
    archivePrefix={arXiv},
    primaryClass={cs.CL},
    url={https://arxiv.org/abs/2304.12244},
}
源代码位于 src/distilabel/steps/tasks/evol_instruct/generator.py
class EvolInstructGenerator(GeneratorTask):
    """Generate evolved instructions using an `LLM`.

    WizardLM: Empowering Large Language Models to Follow Complex Instructions

    Attributes:
        num_instructions: The number of instructions to be generated.
        generate_answers: Whether to generate answers for the instructions or not. Defaults
            to `False`.
        mutation_templates: The mutation templates to be used for the generation of the
            instructions.
        min_length: Defines the length (in bytes) that the generated instruction needs to
            be higher than, to be considered valid. Defaults to `512`.
        max_length: Defines the length (in bytes) that the generated instruction needs to
            be lower than, to be considered valid. Defaults to `1024`.
        seed: The seed to be set for `numpy` in order to randomly pick a mutation method.
            Defaults to `42`.

    Runtime parameters:
        - `min_length`: Defines the length (in bytes) that the generated instruction needs
            to be higher than, to be considered valid.
        - `max_length`: Defines the length (in bytes) that the generated instruction needs
            to be lower than, to be considered valid.
        - `seed`: The seed to be set for `numpy` in order to randomly pick a mutation method.

    Output columns:
        - instruction (`str`): The generated instruction if `generate_answers=False`.
        - answer (`str`): The generated answer if `generate_answers=True`.
        - instructions (`List[str]`): The generated instructions if `generate_answers=True`.
        - model_name (`str`): The name of the LLM used to generate and evolve the instructions.

    Categories:
        - evol
        - instruction
        - generation

    References:
        - [WizardLM: Empowering Large Language Models to Follow Complex Instructions](https://arxiv.org/abs/2304.12244)
        - [GitHub: h2oai/h2o-wizardlm](https://github.com/h2oai/h2o-wizardlm)

    Examples:
        Generate evolved instructions without initial instructions:

        ```python
        from distilabel.steps.tasks import EvolInstructGenerator
        from distilabel.models import InferenceEndpointsLLM

        # Consider this as a placeholder for your actual LLM.
        evol_instruct_generator = EvolInstructGenerator(
            llm=InferenceEndpointsLLM(
                model_id="mistralai/Mistral-7B-Instruct-v0.2",
            ),
            num_instructions=2,
        )

        evol_instruct_generator.load()

        result = next(scorer.process())
        # result
        # [{'instruction': 'generated instruction', 'model_name': 'test'}]
        ```

    Citations:
        ```
        @misc{xu2023wizardlmempoweringlargelanguage,
            title={WizardLM: Empowering Large Language Models to Follow Complex Instructions},
            author={Can Xu and Qingfeng Sun and Kai Zheng and Xiubo Geng and Pu Zhao and Jiazhan Feng and Chongyang Tao and Daxin Jiang},
            year={2023},
            eprint={2304.12244},
            archivePrefix={arXiv},
            primaryClass={cs.CL},
            url={https://arxiv.org/abs/2304.12244},
        }
        ```
    """

    num_instructions: int
    generate_answers: bool = False
    mutation_templates: Dict[str, str] = GENERATION_MUTATION_TEMPLATES

    min_length: RuntimeParameter[int] = Field(
        default=512,
        description="Defines the length (in bytes) that the generated instruction needs to be higher than, to be considered valid.",
    )
    max_length: RuntimeParameter[int] = Field(
        default=1024,
        description="Defines the length (in bytes) that the generated instruction needs to be lower than, to be considered valid.",
    )

    seed: RuntimeParameter[int] = Field(
        default=42,
        description="As `numpy` is being used in order to randomly pick a mutation method, then is nice to seed a random seed.",
    )
    _seed_texts: Optional[List[str]] = PrivateAttr(default_factory=list)
    _prompts: Optional[List[str]] = PrivateAttr(default_factory=list)

    def _generate_seed_texts(self) -> List[str]:
        """Generates a list of seed texts to be used as part of the starting prompts for the task.

        It will use the `FRESH_START` mutation template, as it needs to generate text from scratch; and
        a list of English words will be used to generate the seed texts that will be provided to the
        mutation method and included within the prompt.

        Returns:
            A list of seed texts to be used as part of the starting prompts for the task.
        """
        seed_texts = []
        for _ in range(self.num_instructions * 10):
            num_words = np.random.choice([1, 2, 3, 4])
            seed_texts.append(
                self.mutation_templates["FRESH_START"].replace(  # type: ignore
                    "<PROMPT>",
                    ", ".join(
                        [
                            np.random.choice(self._english_nouns).strip()
                            for _ in range(num_words)
                        ]
                    ),
                )
            )
        return seed_texts

    @override
    def model_post_init(self, __context: Any) -> None:
        """Override this method to perform additional initialization after `__init__` and `model_construct`.
        This is useful if you want to do some validation that requires the entire model to be initialized.
        """
        super().model_post_init(__context)

        np.random.seed(self.seed)

        self._seed_texts = self._generate_seed_texts()
        self._prompts = [
            np.random.choice(self._seed_texts) for _ in range(self.num_instructions)
        ]

    @cached_property
    def _english_nouns(self) -> List[str]:
        """A list of English nouns to be used as part of the starting prompts for the task.

        References:
            - https://github.com/h2oai/h2o-wizardlm
        """
        _path = str(
            importlib_resources.files("distilabel")
            / "steps/tasks/evol_instruct/english_nouns.txt"
        )
        with open(_path, mode="r") as f:
            return [line.strip() for line in f.readlines()]

    @property
    def outputs(self) -> List[str]:
        """The output for the task are the `instruction`, the `answer` if `generate_answers=True`
        and the `model_name`."""
        _outputs = ["instruction", "model_name"]
        if self.generate_answers:
            _outputs.append("answer")
        return _outputs

    def format_output(  # type: ignore
        self, instruction: str, answer: Optional[str] = None
    ) -> Dict[str, Any]:
        """The output for the task is a dict with: `instruction`; `answer` if `generate_answers=True`;
        and, finally, the `model_name`.

        Args:
            instruction: The instruction to be included within the output.
            answer: The answer to be included within the output if `generate_answers=True`.

        Returns:
            If `generate_answers=True` return {"instruction": ..., "answer": ..., "model_name": ...};
            if `generate_answers=False` return {"instruction": ..., "model_name": ...};
        """
        _output = {
            "instruction": instruction,
            "model_name": self.llm.model_name,
        }
        if self.generate_answers and answer is not None:
            _output["answer"] = answer
        return _output

    @property
    def mutation_templates_names(self) -> List[str]:
        """Returns the names i.e. keys of the provided `mutation_templates`."""
        return list(self.mutation_templates.keys())

    def _apply_random_mutation(self, iter_no: int) -> List["ChatType"]:
        """Applies a random mutation from the ones provided as part of the `mutation_templates`
        enum, and returns the provided instruction within the mutation prompt.

        Args:
            iter_no: The iteration number to be used to check whether the iteration is the
                first one i.e. FRESH_START, or not.

        Returns:
            A random mutation prompt with the provided instruction formatted as an OpenAI conversation.
        """
        prompts = []
        for idx in range(self.num_instructions):
            if (
                iter_no == 0
                or "Write one question or request containing" in self._prompts[idx]  # type: ignore
            ):
                mutation = "FRESH_START"
            else:
                mutation = np.random.choice(self.mutation_templates_names)
                if mutation == "FRESH_START":
                    self._prompts[idx] = np.random.choice(self._seed_texts)  # type: ignore

            prompt_with_template = (
                self.mutation_templates[mutation].replace(  # type: ignore
                    "<PROMPT>",
                    self._prompts[idx],  # type: ignore
                )  # type: ignore
                if iter_no != 0
                else self._prompts[idx]  # type: ignore
            )
            prompts.append([{"role": "user", "content": prompt_with_template}])
        return prompts

    def _generate_answers(
        self, instructions: List[List[str]]
    ) -> Tuple[List[str], "LLMStatistics"]:
        """Generates the answer for the last instruction in `instructions`.

        Args:
            instructions: A list of lists where each item is a list with either the last
                evolved instruction if `store_evolutions=False` or all the evolved instructions
                if `store_evolutions=True`.

        Returns:
            A list of answers for the last instruction in `instructions`.
        """
        # TODO: update to generate answers for all the instructions
        _formatted_instructions = [
            [{"role": "user", "content": instruction[-1]}]
            for instruction in instructions
        ]
        responses = self.llm.generate(
            _formatted_instructions,
            **self.llm.generation_kwargs,  # type: ignore
        )
        statistics: Dict[str, Any] = defaultdict(list)
        for response in responses:
            for k, v in response["statistics"].items():
                statistics[k].append(v[0])

        return flatten_responses(
            [response["generations"] for response in responses]
        ), dict(statistics)

    @override
    def process(self, offset: int = 0) -> "GeneratorStepOutput":  # NOQA: C901, type: ignore
        """Processes the inputs of the task and generates the outputs using the LLM.

        Args:
            offset: The offset to start the generation from. Defaults to 0.

        Yields:
            A list of Python dictionaries with the outputs of the task, and a boolean
            flag indicating whether the task has finished or not i.e. is the last batch.
        """
        instructions = []
        mutation_no = 0

        # TODO: update to take into account `offset`
        iter_no = 0
        while len(instructions) < self.num_instructions:
            prompts = self._apply_random_mutation(iter_no=iter_no)

            # TODO: Update the function to extract from the dict
            responses = self.llm.generate(prompts, **self.llm.generation_kwargs)  # type: ignore

            generated_prompts = flatten_responses(
                [response["generations"] for response in responses]
            )
            statistics: "LLMStatistics" = defaultdict(list)
            for response in responses:
                for k, v in response["statistics"].items():
                    statistics[k].append(v[0])

            for idx, generated_prompt in enumerate(generated_prompts):
                generated_prompt = generated_prompt.split("Prompt#:")[-1].strip()
                if self.max_length >= len(generated_prompt) >= self.min_length:  # type: ignore
                    instructions.append(generated_prompt)
                    self._prompts[idx] = np.random.choice(self._seed_texts)  # type: ignore
                else:
                    self._prompts[idx] = generated_prompt  # type: ignore

            self._logger.info(
                f"🔄 Ran iteration {iter_no} with {len(instructions)} instructions already evolved!"
            )
            iter_no += 1

            if len(instructions) > self.num_instructions:
                instructions = instructions[: self.num_instructions]
            if len(instructions) > mutation_no:
                mutation_no = len(instructions) - mutation_no

            if not self.generate_answers and len(instructions[-mutation_no:]) > 0:
                formatted_generations = []
                for mutated_instruction in instructions[-mutation_no:]:
                    mutated_instruction = self.format_output(mutated_instruction)
                    mutated_instruction["distilabel_metadata"] = {
                        f"statistics_instruction_{self.name}": dict(statistics)
                    }
                    formatted_generations.append(mutated_instruction)
                yield (
                    formatted_generations,
                    len(instructions) >= self.num_instructions,
                )

        self._logger.info(f"🎉 Finished evolving {len(instructions)} instructions!")

        if self.generate_answers:
            self._logger.info(
                f"🧠 Generating answers for the {len(instructions)} evolved instructions!"
            )

            answers, statistics = self._generate_answers(instructions)

            self._logger.info(
                f"🎉 Finished generating answers for the {len(instructions)} evolved instructions!"
            )

            formatted_outputs = []
            for instruction, answer in zip(instructions, answers):
                formatted_output = self.format_output(instruction, answer)
                formatted_output["distilabel_metadata"] = {
                    f"statistics_answer_{self.name}": dict(statistics)
                }
                formatted_outputs.append(formatted_output)

            yield (
                formatted_outputs,
                True,
            )

    @override
    def _sample_input(self) -> "ChatType":
        return self._apply_random_mutation(iter_no=0)[0]
_english_nouns cached property

作为任务起始提示一部分使用的英语名词列表。

参考
  • https://github.com/h2oai/h2o-wizardlm
outputs property

任务的输出是 instructionanswer(如果 generate_answers=True)和 model_name

mutation_templates_names property

返回提供的 mutation_templates 的名称,即键。

_generate_seed_texts()

生成要用作任务起始提示一部分的种子文本列表。

它将使用 FRESH_START 突变模板,因为它需要从头开始生成文本;并且将使用英语单词列表来生成将提供给突变方法并包含在提示中的种子文本。

返回

类型 描述
List[str]

要用作任务起始提示一部分的种子文本列表。

源代码位于 src/distilabel/steps/tasks/evol_instruct/generator.py
def _generate_seed_texts(self) -> List[str]:
    """Generates a list of seed texts to be used as part of the starting prompts for the task.

    It will use the `FRESH_START` mutation template, as it needs to generate text from scratch; and
    a list of English words will be used to generate the seed texts that will be provided to the
    mutation method and included within the prompt.

    Returns:
        A list of seed texts to be used as part of the starting prompts for the task.
    """
    seed_texts = []
    for _ in range(self.num_instructions * 10):
        num_words = np.random.choice([1, 2, 3, 4])
        seed_texts.append(
            self.mutation_templates["FRESH_START"].replace(  # type: ignore
                "<PROMPT>",
                ", ".join(
                    [
                        np.random.choice(self._english_nouns).strip()
                        for _ in range(num_words)
                    ]
                ),
            )
        )
    return seed_texts
model_post_init(__context)

覆盖此方法以在 __init__model_construct 之后执行其他初始化。如果您想执行一些需要整个模型初始化的验证,这将非常有用。

源代码位于 src/distilabel/steps/tasks/evol_instruct/generator.py
@override
def model_post_init(self, __context: Any) -> None:
    """Override this method to perform additional initialization after `__init__` and `model_construct`.
    This is useful if you want to do some validation that requires the entire model to be initialized.
    """
    super().model_post_init(__context)

    np.random.seed(self.seed)

    self._seed_texts = self._generate_seed_texts()
    self._prompts = [
        np.random.choice(self._seed_texts) for _ in range(self.num_instructions)
    ]
format_output(instruction, answer=None)

任务的输出是一个字典,其中包含:instructionanswer(如果 generate_answers=True);以及最后的 model_name

参数

名称 类型 描述 默认值
instruction str

要包含在输出中的指令。

必需
answer Optional[str]

如果 generate_answers=True,则要包含在输出中的答案。

None

返回

类型 描述
Dict[str, Any]

如果 generate_answers=True,则返回 {"instruction": ..., "answer": ..., "model_name": ...};

Dict[str, Any]

如果 generate_answers=False,则返回 {"instruction": ..., "model_name": ...};

源代码位于 src/distilabel/steps/tasks/evol_instruct/generator.py
def format_output(  # type: ignore
    self, instruction: str, answer: Optional[str] = None
) -> Dict[str, Any]:
    """The output for the task is a dict with: `instruction`; `answer` if `generate_answers=True`;
    and, finally, the `model_name`.

    Args:
        instruction: The instruction to be included within the output.
        answer: The answer to be included within the output if `generate_answers=True`.

    Returns:
        If `generate_answers=True` return {"instruction": ..., "answer": ..., "model_name": ...};
        if `generate_answers=False` return {"instruction": ..., "model_name": ...};
    """
    _output = {
        "instruction": instruction,
        "model_name": self.llm.model_name,
    }
    if self.generate_answers and answer is not None:
        _output["answer"] = answer
    return _output
_apply_random_mutation(iter_no)

应用从作为 mutation_templates 枚举一部分提供的突变中随机选择的突变,并在突变提示中返回提供的指令。

参数

名称 类型 描述 默认值
iter_no int

用于检查迭代是否是第一次迭代(即 FRESH_START)的迭代次数。

必需

返回

类型 描述
List[ChatType]

带有提供的指令的随机突变提示,格式为 OpenAI 对话。

源代码位于 src/distilabel/steps/tasks/evol_instruct/generator.py
def _apply_random_mutation(self, iter_no: int) -> List["ChatType"]:
    """Applies a random mutation from the ones provided as part of the `mutation_templates`
    enum, and returns the provided instruction within the mutation prompt.

    Args:
        iter_no: The iteration number to be used to check whether the iteration is the
            first one i.e. FRESH_START, or not.

    Returns:
        A random mutation prompt with the provided instruction formatted as an OpenAI conversation.
    """
    prompts = []
    for idx in range(self.num_instructions):
        if (
            iter_no == 0
            or "Write one question or request containing" in self._prompts[idx]  # type: ignore
        ):
            mutation = "FRESH_START"
        else:
            mutation = np.random.choice(self.mutation_templates_names)
            if mutation == "FRESH_START":
                self._prompts[idx] = np.random.choice(self._seed_texts)  # type: ignore

        prompt_with_template = (
            self.mutation_templates[mutation].replace(  # type: ignore
                "<PROMPT>",
                self._prompts[idx],  # type: ignore
            )  # type: ignore
            if iter_no != 0
            else self._prompts[idx]  # type: ignore
        )
        prompts.append([{"role": "user", "content": prompt_with_template}])
    return prompts
_generate_answers(instructions)

instructions 中最后一个指令生成答案。

参数

名称 类型 描述 默认值
instructions List[List[str]]

一个列表的列表,其中每个项目都是一个列表,其中包含最后一个进化后的指令(如果 store_evolutions=False)或所有进化后的指令(如果 store_evolutions=True)。

必需

返回

类型 描述
Tuple[List[str], LLMStatistics]

instructions 中最后一个指令的答案列表。

源代码位于 src/distilabel/steps/tasks/evol_instruct/generator.py
def _generate_answers(
    self, instructions: List[List[str]]
) -> Tuple[List[str], "LLMStatistics"]:
    """Generates the answer for the last instruction in `instructions`.

    Args:
        instructions: A list of lists where each item is a list with either the last
            evolved instruction if `store_evolutions=False` or all the evolved instructions
            if `store_evolutions=True`.

    Returns:
        A list of answers for the last instruction in `instructions`.
    """
    # TODO: update to generate answers for all the instructions
    _formatted_instructions = [
        [{"role": "user", "content": instruction[-1]}]
        for instruction in instructions
    ]
    responses = self.llm.generate(
        _formatted_instructions,
        **self.llm.generation_kwargs,  # type: ignore
    )
    statistics: Dict[str, Any] = defaultdict(list)
    for response in responses:
        for k, v in response["statistics"].items():
            statistics[k].append(v[0])

    return flatten_responses(
        [response["generations"] for response in responses]
    ), dict(statistics)
process(offset=0)

处理任务的输入,并使用 LLM 生成输出。

参数

名称 类型 描述 默认值
offset int

开始生成的偏移量。默认为 0。

0

产生

类型 描述
GeneratorStepOutput

包含任务输出的 Python 字典列表,以及一个布尔值

GeneratorStepOutput

标志,指示任务是否已完成,即是否是最后一个批次。

源代码位于 src/distilabel/steps/tasks/evol_instruct/generator.py
@override
def process(self, offset: int = 0) -> "GeneratorStepOutput":  # NOQA: C901, type: ignore
    """Processes the inputs of the task and generates the outputs using the LLM.

    Args:
        offset: The offset to start the generation from. Defaults to 0.

    Yields:
        A list of Python dictionaries with the outputs of the task, and a boolean
        flag indicating whether the task has finished or not i.e. is the last batch.
    """
    instructions = []
    mutation_no = 0

    # TODO: update to take into account `offset`
    iter_no = 0
    while len(instructions) < self.num_instructions:
        prompts = self._apply_random_mutation(iter_no=iter_no)

        # TODO: Update the function to extract from the dict
        responses = self.llm.generate(prompts, **self.llm.generation_kwargs)  # type: ignore

        generated_prompts = flatten_responses(
            [response["generations"] for response in responses]
        )
        statistics: "LLMStatistics" = defaultdict(list)
        for response in responses:
            for k, v in response["statistics"].items():
                statistics[k].append(v[0])

        for idx, generated_prompt in enumerate(generated_prompts):
            generated_prompt = generated_prompt.split("Prompt#:")[-1].strip()
            if self.max_length >= len(generated_prompt) >= self.min_length:  # type: ignore
                instructions.append(generated_prompt)
                self._prompts[idx] = np.random.choice(self._seed_texts)  # type: ignore
            else:
                self._prompts[idx] = generated_prompt  # type: ignore

        self._logger.info(
            f"🔄 Ran iteration {iter_no} with {len(instructions)} instructions already evolved!"
        )
        iter_no += 1

        if len(instructions) > self.num_instructions:
            instructions = instructions[: self.num_instructions]
        if len(instructions) > mutation_no:
            mutation_no = len(instructions) - mutation_no

        if not self.generate_answers and len(instructions[-mutation_no:]) > 0:
            formatted_generations = []
            for mutated_instruction in instructions[-mutation_no:]:
                mutated_instruction = self.format_output(mutated_instruction)
                mutated_instruction["distilabel_metadata"] = {
                    f"statistics_instruction_{self.name}": dict(statistics)
                }
                formatted_generations.append(mutated_instruction)
            yield (
                formatted_generations,
                len(instructions) >= self.num_instructions,
            )

    self._logger.info(f"🎉 Finished evolving {len(instructions)} instructions!")

    if self.generate_answers:
        self._logger.info(
            f"🧠 Generating answers for the {len(instructions)} evolved instructions!"
        )

        answers, statistics = self._generate_answers(instructions)

        self._logger.info(
            f"🎉 Finished generating answers for the {len(instructions)} evolved instructions!"
        )

        formatted_outputs = []
        for instruction, answer in zip(instructions, answers):
            formatted_output = self.format_output(instruction, answer)
            formatted_output["distilabel_metadata"] = {
                f"statistics_answer_{self.name}": dict(statistics)
            }
            formatted_outputs.append(formatted_output)

        yield (
            formatted_outputs,
            True,
        )

EvolQuality

基类:Task

使用 LLM 进化响应的质量。

EvolQuality 任务用于进化给定提示的响应质量,方法是使用语言模型生成新的响应。此步骤实现了论文“什么使对齐数据良好?指令调优中自动数据选择的综合研究”中的进化质量任务。

属性

名称 类型 描述
num_evolutions int

要在响应上执行的进化次数。

store_evolutions bool

是否存储所有进化后的响应,还是仅存储最后一个响应。默认为 False

include_original_response bool

是否在进化后的响应中包含原始响应。默认为 False

mutation_templates Dict[str, str]

用于进化响应的突变模板。

seed RuntimeParameter[int]

numpy 设置的种子,以便随机选择突变方法。默认为 42

运行时参数
  • seed:为 numpy 设置的种子,以便随机选择突变方法。
输入列
  • instruction (str): 用于生成 responses 的指令。
  • response (str): 要重写的响应。
输出列
  • evolved_response (str): 如果 store_evolutions=False,则为进化后的响应。
  • evolved_responses (List[str]): 如果 store_evolutions=True,则为进化后的响应。
  • model_name (str): 用于进化响应的 LLM 的名称。
类别
  • evol
  • response
  • deita
参考

示例

进化给定提示的响应质量

from distilabel.steps.tasks import EvolQuality
from distilabel.models import InferenceEndpointsLLM

# Consider this as a placeholder for your actual LLM.
evol_quality = EvolQuality(
    llm=InferenceEndpointsLLM(
        model_id="mistralai/Mistral-7B-Instruct-v0.2",
    ),
    num_evolutions=2,
)

evol_quality.load()

result = next(
    evol_quality.process(
        [
            {"instruction": "common instruction", "response": "a response"},
        ]
    )
)
# result
# [
#     {
#         'instruction': 'common instruction',
#         'response': 'a response',
#         'evolved_response': 'evolved response',
#         'model_name': '"mistralai/Mistral-7B-Instruct-v0.2"'
#     }
# ]
引用
@misc{liu2024makesgooddataalignment,
    title={What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning},
    author={Wei Liu and Weihao Zeng and Keqing He and Yong Jiang and Junxian He},
    year={2024},
    eprint={2312.15685},
    archivePrefix={arXiv},
    primaryClass={cs.CL},
    url={https://arxiv.org/abs/2312.15685},
}
源代码位于 src/distilabel/steps/tasks/evol_quality/base.py
class EvolQuality(Task):
    """Evolve the quality of the responses using an `LLM`.

    `EvolQuality` task is used to evolve the quality of the responses given a prompt,
    by generating a new response with a language model. This step implements the evolution
    quality task from the paper 'What Makes Good Data for Alignment? A Comprehensive Study of
    Automatic Data Selection in Instruction Tuning'.

    Attributes:
        num_evolutions: The number of evolutions to be performed on the responses.
        store_evolutions: Whether to store all the evolved responses or just the last one.
            Defaults to `False`.
        include_original_response: Whether to include the original response within the evolved
            responses. Defaults to `False`.
        mutation_templates: The mutation templates to be used to evolve the responses.
        seed: The seed to be set for `numpy` in order to randomly pick a mutation method.
            Defaults to `42`.

    Runtime parameters:
        - `seed`: The seed to be set for `numpy` in order to randomly pick a mutation method.

    Input columns:
        - instruction (`str`): The instruction that was used to generate the `responses`.
        - response (`str`): The responses to be rewritten.

    Output columns:
        - evolved_response (`str`): The evolved response if `store_evolutions=False`.
        - evolved_responses (`List[str]`): The evolved responses if `store_evolutions=True`.
        - model_name (`str`): The name of the LLM used to evolve the responses.

    Categories:
        - evol
        - response
        - deita

    References:
        - [`What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning`](https://arxiv.org/abs/2312.15685)

    Examples:
        Evolve the quality of the responses given a prompt:

        ```python
        from distilabel.steps.tasks import EvolQuality
        from distilabel.models import InferenceEndpointsLLM

        # Consider this as a placeholder for your actual LLM.
        evol_quality = EvolQuality(
            llm=InferenceEndpointsLLM(
                model_id="mistralai/Mistral-7B-Instruct-v0.2",
            ),
            num_evolutions=2,
        )

        evol_quality.load()

        result = next(
            evol_quality.process(
                [
                    {"instruction": "common instruction", "response": "a response"},
                ]
            )
        )
        # result
        # [
        #     {
        #         'instruction': 'common instruction',
        #         'response': 'a response',
        #         'evolved_response': 'evolved response',
        #         'model_name': '"mistralai/Mistral-7B-Instruct-v0.2"'
        #     }
        # ]
        ```

    Citations:
        ```
        @misc{liu2024makesgooddataalignment,
            title={What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning},
            author={Wei Liu and Weihao Zeng and Keqing He and Yong Jiang and Junxian He},
            year={2024},
            eprint={2312.15685},
            archivePrefix={arXiv},
            primaryClass={cs.CL},
            url={https://arxiv.org/abs/2312.15685},
        }
        ```
    """

    num_evolutions: int
    store_evolutions: bool = False
    include_original_response: bool = False
    mutation_templates: Dict[str, str] = MUTATION_TEMPLATES

    seed: RuntimeParameter[int] = Field(
        default=42,
        description="As `numpy` is being used in order to randomly pick a mutation method, then is nice to set a random seed.",
    )

    @override
    def model_post_init(self, __context: Any) -> None:
        """Override this method to perform additional initialization after `__init__` and `model_construct`.
        This is useful if you want to do some validation that requires the entire model to be initialized.
        """
        super().model_post_init(__context)

    @property
    def inputs(self) -> List[str]:
        """The input for the task are the `instruction` and `response`."""
        return ["instruction", "response"]

    def format_input(self, input: str) -> ChatType:  # type: ignore
        """The input is formatted as a `ChatType` assuming that the instruction
        is the first interaction from the user within a conversation. And the
        `system_prompt` is added as the first message if it exists."""
        return [{"role": "user", "content": input}]

    @property
    def outputs(self) -> List[str]:
        """The output for the task are the `evolved_response/s` and the `model_name`."""
        # TODO: having to define a `model_name` column every time as the `Task.outputs` is not ideal,
        # this could be handled always and the value could be included within the DAG validation when
        # a `Task` is used, since all the `Task` subclasses will have an `llm` with a `model_name` attr.
        _outputs = [
            ("evolved_response" if not self.store_evolutions else "evolved_responses"),
            "model_name",
        ]

        return _outputs

    def format_output(self, responses: Union[str, List[str]]) -> Dict[str, Any]:  # type: ignore
        """The output for the task is a dict with: `evolved_response` or `evolved_responses`,
        depending whether the value is either `False` or `True` for `store_evolutions`, respectively;
        and, finally, the `model_name`.

        Args:
            responses: The responses to be included within the output.

        Returns:
            if `store_evolutions=False` return {"evolved_response": ..., "model_name": ...};
            if `store_evolutions=True` return {"evolved_responses": ..., "model_name": ...}.
        """
        _output = {}

        if not self.store_evolutions:
            _output["evolved_response"] = responses[-1]
        else:
            _output["evolved_responses"] = responses

        _output["model_name"] = self.llm.model_name
        return _output

    @property
    def mutation_templates_names(self) -> List[str]:
        """Returns the names i.e. keys of the provided `mutation_templates` enum."""
        return list(self.mutation_templates.keys())

    def _apply_random_mutation(self, instruction: str, response: str) -> str:
        """Applies a random mutation from the ones provided as part of the `mutation_templates`
        enum, and returns the provided instruction within the mutation prompt.

        Args:
            instruction: The instruction to be included within the mutation prompt.

        Returns:
            A random mutation prompt with the provided instruction.
        """
        mutation = np.random.choice(self.mutation_templates_names)
        return (
            self.mutation_templates[mutation]
            .replace("<PROMPT>", instruction)
            .replace("<RESPONSE>", response)
        )

    def _evolve_reponses(
        self, inputs: "StepInput"
    ) -> Tuple[List[List[str]], Dict[str, Any]]:
        """Evolves the instructions provided as part of the inputs of the task.

        Args:
            inputs: A list of Python dictionaries with the inputs of the task.

        Returns:
            A list where each item is a list with either the last evolved instruction if
            `store_evolutions=False` or all the evolved instructions if `store_evolutions=True`.
        """
        np.random.seed(self.seed)
        instructions: List[List[str]] = [[input["instruction"]] for input in inputs]
        responses: List[List[str]] = [[input["response"]] for input in inputs]
        statistics: Dict[str, Any] = defaultdict(list)

        for iter_no in range(self.num_evolutions):
            formatted_prompts = []
            for instruction, response in zip(instructions, responses):
                formatted_prompts.append(
                    self._apply_random_mutation(instruction[-1], response[-1])
                )

            formatted_prompts = [
                self.format_input(prompt) for prompt in formatted_prompts
            ]

            generated_responses = self.llm.generate(
                formatted_prompts,
                **self.llm.generation_kwargs,  # type: ignore
            )
            for response in generated_responses:
                for k, v in response["statistics"].items():
                    statistics[k].append(v[0])

            if self.store_evolutions:
                responses = [
                    response + [evolved_response["generations"][0]]
                    for response, evolved_response in zip(
                        responses, generated_responses
                    )
                ]
            else:
                responses = [
                    [evolved_response["generations"][0]]
                    for evolved_response in generated_responses
                ]

            self._logger.info(
                f"🔄 Ran iteration {iter_no} evolving {len(responses)} responses!"
            )

        return responses, dict(statistics)

    @override
    def process(self, inputs: StepInput) -> "StepOutput":  # type: ignore
        """Processes the inputs of the task and generates the outputs using the LLM.

        Args:
            inputs: A list of Python dictionaries with the inputs of the task.

        Returns:
            A list of Python dictionaries with the outputs of the task.
        """

        responses, statistics = self._evolve_reponses(inputs)

        if self.store_evolutions:
            # Remove the input instruction from the `evolved_responses` list
            from_ = 1 if not self.include_original_response else 0
            responses = [response[from_:] for response in responses]

        for input, response in zip(inputs, responses):
            input.update(self.format_output(response))
            input.update(
                {"distilabel_metadata": {f"statistics_{self.name}": statistics}}
            )
        yield inputs

        self._logger.info(f"🎉 Finished evolving {len(responses)} instructions!")

    @override
    def _sample_input(self) -> ChatType:
        return self.format_input("<PLACEHOLDER_INSTRUCTION>")
inputs property

任务的输入是 instructionresponse

outputs property

任务的输出是 evolved_response/smodel_name

mutation_templates_names property

返回提供的 mutation_templates 枚举的名称,即键。

model_post_init(__context)

覆盖此方法以在 __init__model_construct 之后执行其他初始化。如果您想执行一些需要整个模型初始化的验证,这将非常有用。

源代码位于 src/distilabel/steps/tasks/evol_quality/base.py
@override
def model_post_init(self, __context: Any) -> None:
    """Override this method to perform additional initialization after `__init__` and `model_construct`.
    This is useful if you want to do some validation that requires the entire model to be initialized.
    """
    super().model_post_init(__context)
format_input(input)

输入被格式化为 ChatType,假设指令是用户在对话中的首次互动。如果存在 system_prompt,则将其添加为第一条消息。

源代码位于 src/distilabel/steps/tasks/evol_quality/base.py
def format_input(self, input: str) -> ChatType:  # type: ignore
    """The input is formatted as a `ChatType` assuming that the instruction
    is the first interaction from the user within a conversation. And the
    `system_prompt` is added as the first message if it exists."""
    return [{"role": "user", "content": input}]
format_output(responses)

任务的输出是一个字典,其中包含:evolved_responseevolved_responses,具体取决于 store_evolutions 的值为 False 还是 True;以及最后的 model_name

参数

名称 类型 描述 默认值
responses Union[str, List[str]]

要包含在输出中的响应。

必需

返回

类型 描述
Dict[str, Any]

如果 store_evolutions=False,则返回 {"evolved_response": ..., "model_name": ...};

Dict[str, Any]

如果 store_evolutions=True,则返回 {"evolved_responses": ..., "model_name": ...}。

源代码位于 src/distilabel/steps/tasks/evol_quality/base.py
def format_output(self, responses: Union[str, List[str]]) -> Dict[str, Any]:  # type: ignore
    """The output for the task is a dict with: `evolved_response` or `evolved_responses`,
    depending whether the value is either `False` or `True` for `store_evolutions`, respectively;
    and, finally, the `model_name`.

    Args:
        responses: The responses to be included within the output.

    Returns:
        if `store_evolutions=False` return {"evolved_response": ..., "model_name": ...};
        if `store_evolutions=True` return {"evolved_responses": ..., "model_name": ...}.
    """
    _output = {}

    if not self.store_evolutions:
        _output["evolved_response"] = responses[-1]
    else:
        _output["evolved_responses"] = responses

    _output["model_name"] = self.llm.model_name
    return _output
_apply_random_mutation(instruction, response)

应用从作为 mutation_templates 枚举一部分提供的突变中随机选择的突变,并在突变提示中返回提供的指令。

参数

名称 类型 描述 默认值
instruction str

要包含在突变提示中的指令。

必需

返回

类型 描述
str

带有提供的指令的随机突变提示。

源代码位于 src/distilabel/steps/tasks/evol_quality/base.py
def _apply_random_mutation(self, instruction: str, response: str) -> str:
    """Applies a random mutation from the ones provided as part of the `mutation_templates`
    enum, and returns the provided instruction within the mutation prompt.

    Args:
        instruction: The instruction to be included within the mutation prompt.

    Returns:
        A random mutation prompt with the provided instruction.
    """
    mutation = np.random.choice(self.mutation_templates_names)
    return (
        self.mutation_templates[mutation]
        .replace("<PROMPT>", instruction)
        .replace("<RESPONSE>", response)
    )
_evolve_reponses(inputs)

进化作为任务输入一部分提供的指令。

参数

名称 类型 描述 默认值
inputs StepInput

包含任务输入的 Python 字典列表。

必需

返回

类型 描述
List[List[str]]

一个列表,其中每个项目都是一个列表,其中包含最后一个进化后的指令(如果

Dict[str, Any]

store_evolutions=False)或所有进化后的指令(如果 store_evolutions=True)。

源代码位于 src/distilabel/steps/tasks/evol_quality/base.py
def _evolve_reponses(
    self, inputs: "StepInput"
) -> Tuple[List[List[str]], Dict[str, Any]]:
    """Evolves the instructions provided as part of the inputs of the task.

    Args:
        inputs: A list of Python dictionaries with the inputs of the task.

    Returns:
        A list where each item is a list with either the last evolved instruction if
        `store_evolutions=False` or all the evolved instructions if `store_evolutions=True`.
    """
    np.random.seed(self.seed)
    instructions: List[List[str]] = [[input["instruction"]] for input in inputs]
    responses: List[List[str]] = [[input["response"]] for input in inputs]
    statistics: Dict[str, Any] = defaultdict(list)

    for iter_no in range(self.num_evolutions):
        formatted_prompts = []
        for instruction, response in zip(instructions, responses):
            formatted_prompts.append(
                self._apply_random_mutation(instruction[-1], response[-1])
            )

        formatted_prompts = [
            self.format_input(prompt) for prompt in formatted_prompts
        ]

        generated_responses = self.llm.generate(
            formatted_prompts,
            **self.llm.generation_kwargs,  # type: ignore
        )
        for response in generated_responses:
            for k, v in response["statistics"].items():
                statistics[k].append(v[0])

        if self.store_evolutions:
            responses = [
                response + [evolved_response["generations"][0]]
                for response, evolved_response in zip(
                    responses, generated_responses
                )
            ]
        else:
            responses = [
                [evolved_response["generations"][0]]
                for evolved_response in generated_responses
            ]

        self._logger.info(
            f"🔄 Ran iteration {iter_no} evolving {len(responses)} responses!"
        )

    return responses, dict(statistics)
process(inputs)

处理任务的输入,并使用 LLM 生成输出。

参数

名称 类型 描述 默认值
inputs StepInput

包含任务输入的 Python 字典列表。

必需

返回

类型 描述
StepOutput

包含任务输出的 Python 字典列表。

源代码位于 src/distilabel/steps/tasks/evol_quality/base.py
@override
def process(self, inputs: StepInput) -> "StepOutput":  # type: ignore
    """Processes the inputs of the task and generates the outputs using the LLM.

    Args:
        inputs: A list of Python dictionaries with the inputs of the task.

    Returns:
        A list of Python dictionaries with the outputs of the task.
    """

    responses, statistics = self._evolve_reponses(inputs)

    if self.store_evolutions:
        # Remove the input instruction from the `evolved_responses` list
        from_ = 1 if not self.include_original_response else 0
        responses = [response[from_:] for response in responses]

    for input, response in zip(inputs, responses):
        input.update(self.format_output(response))
        input.update(
            {"distilabel_metadata": {f"statistics_{self.name}": statistics}}
        )
    yield inputs

    self._logger.info(f"🎉 Finished evolving {len(responses)} instructions!")

GenerateEmbeddings

基类:Step

使用 LLM 的最后隐藏状态生成嵌入。

使用 LLM 的最后隐藏状态为文本输入生成嵌入,如论文“什么使对齐数据良好?指令调优中自动数据选择的综合研究”中所述。

属性

名称 类型 描述
llm LLM

用于生成嵌入的 LLM

输入列
  • text (str, List[Dict[str, str]]): 要为其生成嵌入的输入文本或对话。
输出列
  • embedding (List[float]): 输入文本或对话的嵌入。
  • model_name (str): 用于生成嵌入的模型名称。
类别
  • embedding
  • llm
参考

示例

对 LLM 候选对象进行排名

from distilabel.steps.tasks import GenerateEmbeddings
from distilabel.models.llms.huggingface import TransformersLLM

# Consider this as a placeholder for your actual LLM.
embedder = GenerateEmbeddings(
    llm=TransformersLLM(
        model="TaylorAI/bge-micro-v2",
        model_kwargs={"is_decoder": True},
        cuda_devices=[],
    )
)
embedder.load()

result = next(
    embedder.process(
        [
            {"text": "Hello, how are you?"},
        ]
    )
)
引用
@misc{liu2024makesgooddataalignment,
    title={What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning},
    author={Wei Liu and Weihao Zeng and Keqing He and Yong Jiang and Junxian He},
    year={2024},
    eprint={2312.15685},
    archivePrefix={arXiv},
    primaryClass={cs.CL},
    url={https://arxiv.org/abs/2312.15685},
}
源代码位于 src/distilabel/steps/tasks/generate_embeddings.py
class GenerateEmbeddings(Step):
    """Generate embeddings using the last hidden state of an `LLM`.

    Generate embeddings for a text input using the last hidden state of an `LLM`, as
    described in the paper 'What Makes Good Data for Alignment? A Comprehensive Study of
    Automatic Data Selection in Instruction Tuning'.

    Attributes:
        llm: The `LLM` to use to generate the embeddings.

    Input columns:
        - text (`str`, `List[Dict[str, str]]`): The input text or conversation to generate
            embeddings for.

    Output columns:
        - embedding (`List[float]`): The embedding of the input text or conversation.
        - model_name (`str`): The model name used to generate the embeddings.

    Categories:
        - embedding
        - llm

    References:
        - [What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning](https://arxiv.org/abs/2312.15685)

    Examples:
        Rank LLM candidates:

        ```python
        from distilabel.steps.tasks import GenerateEmbeddings
        from distilabel.models.llms.huggingface import TransformersLLM

        # Consider this as a placeholder for your actual LLM.
        embedder = GenerateEmbeddings(
            llm=TransformersLLM(
                model="TaylorAI/bge-micro-v2",
                model_kwargs={"is_decoder": True},
                cuda_devices=[],
            )
        )
        embedder.load()

        result = next(
            embedder.process(
                [
                    {"text": "Hello, how are you?"},
                ]
            )
        )
        ```

    Citations:
        ```
        @misc{liu2024makesgooddataalignment,
            title={What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning},
            author={Wei Liu and Weihao Zeng and Keqing He and Yong Jiang and Junxian He},
            year={2024},
            eprint={2312.15685},
            archivePrefix={arXiv},
            primaryClass={cs.CL},
            url={https://arxiv.org/abs/2312.15685},
        }
        ```
    """

    llm: LLM

    def load(self) -> None:
        """Loads the `LLM` used to generate the embeddings."""
        super().load()

        self.llm.load()

    @property
    def inputs(self) -> "StepColumns":
        """The inputs for the task is a `text` column containing either a string or a
        list of dictionaries in OpenAI chat-like format."""
        return ["text"]

    @property
    def outputs(self) -> "StepColumns":
        """The outputs for the task is an `embedding` column containing the embedding of
        the `text` input."""
        return ["embedding", "model_name"]

    def format_input(self, input: Dict[str, Any]) -> "ChatType":
        """Formats the input to be used by the LLM to generate the embeddings. The input
        can be in `ChatType` format or a string. If a string, it will be converted to a
        list of dictionaries in OpenAI chat-like format.

        Args:
            input: The input to format.

        Returns:
            The OpenAI chat-like format of the input.
        """
        text = input["text"] = input["text"]

        # input is in `ChatType` format
        if isinstance(text, str):
            return [{"role": "user", "content": text}]

        if is_openai_format(text):
            return text

        raise DistilabelUserError(
            f"Couldn't format input for step {self.name}. The `text` input column has to"
            " be a string or a list of dictionaries in OpenAI chat-like format.",
            page="components-gallery/tasks/generateembeddings/",
        )

    def process(self, inputs: StepInput) -> "StepOutput":  # type: ignore
        """Generates an embedding for each input using the last hidden state of the `LLM`.

        Args:
            inputs: A list of Python dictionaries with the inputs of the task.

        Yields:
            A list of Python dictionaries with the outputs of the task.
        """
        formatted_inputs = [self.format_input(input) for input in inputs]
        last_hidden_states = self.llm.get_last_hidden_states(formatted_inputs)
        for input, hidden_state in zip(inputs, last_hidden_states):
            input["embedding"] = hidden_state[-1].tolist()
            input["model_name"] = self.llm.model_name
        yield inputs
inputs property

任务的输入是一个 text 列,其中包含字符串或 OpenAI 类聊天格式的字典列表。

outputs property

任务的输出是一个 embedding 列,其中包含 text 输入的嵌入。

load()

加载用于生成嵌入的 LLM

源代码位于 src/distilabel/steps/tasks/generate_embeddings.py
def load(self) -> None:
    """Loads the `LLM` used to generate the embeddings."""
    super().load()

    self.llm.load()
format_input(input)

格式化输入以供 LLM 用于生成嵌入。输入可以是 ChatType 格式或字符串。如果是字符串,则将其转换为 OpenAI 类聊天格式的字典列表。

参数

名称 类型 描述 默认值
input Dict[str, Any]

要格式化的输入。

必需

返回

类型 描述
ChatType

输入的 OpenAI 类聊天格式。

源代码位于 src/distilabel/steps/tasks/generate_embeddings.py
def format_input(self, input: Dict[str, Any]) -> "ChatType":
    """Formats the input to be used by the LLM to generate the embeddings. The input
    can be in `ChatType` format or a string. If a string, it will be converted to a
    list of dictionaries in OpenAI chat-like format.

    Args:
        input: The input to format.

    Returns:
        The OpenAI chat-like format of the input.
    """
    text = input["text"] = input["text"]

    # input is in `ChatType` format
    if isinstance(text, str):
        return [{"role": "user", "content": text}]

    if is_openai_format(text):
        return text

    raise DistilabelUserError(
        f"Couldn't format input for step {self.name}. The `text` input column has to"
        " be a string or a list of dictionaries in OpenAI chat-like format.",
        page="components-gallery/tasks/generateembeddings/",
    )
process(inputs)

使用 LLM 的最后隐藏状态为每个输入生成嵌入。

参数

名称 类型 描述 默认值
inputs StepInput

包含任务输入的 Python 字典列表。

必需

产生

类型 描述
StepOutput

包含任务输出的 Python 字典列表。

源代码位于 src/distilabel/steps/tasks/generate_embeddings.py
def process(self, inputs: StepInput) -> "StepOutput":  # type: ignore
    """Generates an embedding for each input using the last hidden state of the `LLM`.

    Args:
        inputs: A list of Python dictionaries with the inputs of the task.

    Yields:
        A list of Python dictionaries with the outputs of the task.
    """
    formatted_inputs = [self.format_input(input) for input in inputs]
    last_hidden_states = self.llm.get_last_hidden_states(formatted_inputs)
    for input, hidden_state in zip(inputs, last_hidden_states):
        input["embedding"] = hidden_state[-1].tolist()
        input["model_name"] = self.llm.model_name
    yield inputs

Genstruct

基类:Task

使用 LLM 从文档中生成指令-响应对。

Genstruct 是一个预定义的任务,旨在从给定的原始文档(包含标题和内容)中生成有效的指令,从而能够从任何原始文本语料库中创建新的、部分合成的指令微调数据集。此任务基于 Nous Research 的 Genstruct 7B 模型,该模型灵感来源于 Ada-Instruct 论文。

注意

Genstruct 提示,即任务本身,实际上可以与任何模型一起使用,但最安全/推荐的选择是使用 NousResearch/Genstruct-7B 作为提供给任务的 LLM,因为它专门为这个特定任务而训练。

属性

名称 类型 描述
_template Union[Template, None]

用于格式化 LLM 输入的 Jinja2 模板。

输入列
  • title (str): 文档的标题。
  • content (str): 文档的内容。
输出列
  • user (str): 用户基于文档的指令。
  • assistant (str): 助手基于用户指令的回复。
  • model_name (str): 用于生成 feedbackresult 的模型名称。
类别
  • 文本生成
  • instruction
  • response
参考

示例

使用标题和内容从原始文档生成指令

from distilabel.steps.tasks import Genstruct
from distilabel.models import InferenceEndpointsLLM

# Consider this as a placeholder for your actual LLM.
genstruct = Genstruct(
    llm=InferenceEndpointsLLM(
        model_id="NousResearch/Genstruct-7B",
    ),
)

genstruct.load()

result = next(
    genstruct.process(
        [
            {"title": "common instruction", "content": "content of the document"},
        ]
    )
)
# result
# [
#     {
#         'title': 'An instruction',
#         'content': 'content of the document',
#         'model_name': 'test',
#         'user': 'An instruction',
#         'assistant': 'content of the document',
#     }
# ]
引用
@misc{cui2023adainstructadaptinginstructiongenerators,
    title={Ada-Instruct: Adapting Instruction Generators for Complex Reasoning},
    author={Wanyun Cui and Qianle Wang},
    year={2023},
    eprint={2310.04484},
    archivePrefix={arXiv},
    primaryClass={cs.CL},
    url={https://arxiv.org/abs/2310.04484},
}
源代码位于 src/distilabel/steps/tasks/genstruct.py
class Genstruct(Task):
    """Generate a pair of instruction-response from a document using an `LLM`.

    `Genstruct` is a pre-defined task designed to generate valid instructions from a given raw document,
    with the title and the content, enabling the creation of new, partially synthetic instruction finetuning
    datasets from any raw-text corpus. The task is based on the Genstruct 7B model by Nous Research, which is
    inspired in the Ada-Instruct paper.

    Note:
        The Genstruct prompt i.e. the task, can be used with any model really, but the safest / recommended
        option is to use `NousResearch/Genstruct-7B` as the LLM provided to the task, since it was trained
        for this specific task.

    Attributes:
        _template: a Jinja2 template used to format the input for the LLM.

    Input columns:
        - title (`str`): The title of the document.
        - content (`str`): The content of the document.

    Output columns:
        - user (`str`): The user's instruction based on the document.
        - assistant (`str`): The assistant's response based on the user's instruction.
        - model_name (`str`): The model name used to generate the `feedback` and `result`.

    Categories:
        - text-generation
        - instruction
        - response

    References:
        - [Genstruct 7B by Nous Research](https://hugging-face.cn/NousResearch/Genstruct-7B)
        - [Ada-Instruct: Adapting Instruction Generators for Complex Reasoning](https://arxiv.org/abs/2310.04484)

    Examples:
        Generate instructions from raw documents using the title and content:

        ```python
        from distilabel.steps.tasks import Genstruct
        from distilabel.models import InferenceEndpointsLLM

        # Consider this as a placeholder for your actual LLM.
        genstruct = Genstruct(
            llm=InferenceEndpointsLLM(
                model_id="NousResearch/Genstruct-7B",
            ),
        )

        genstruct.load()

        result = next(
            genstruct.process(
                [
                    {"title": "common instruction", "content": "content of the document"},
                ]
            )
        )
        # result
        # [
        #     {
        #         'title': 'An instruction',
        #         'content': 'content of the document',
        #         'model_name': 'test',
        #         'user': 'An instruction',
        #         'assistant': 'content of the document',
        #     }
        # ]
        ```

    Citations:
        ```
        @misc{cui2023adainstructadaptinginstructiongenerators,
            title={Ada-Instruct: Adapting Instruction Generators for Complex Reasoning},
            author={Wanyun Cui and Qianle Wang},
            year={2023},
            eprint={2310.04484},
            archivePrefix={arXiv},
            primaryClass={cs.CL},
            url={https://arxiv.org/abs/2310.04484},
        }
        ```
    """

    _template: Union[Template, None] = PrivateAttr(...)

    def load(self) -> None:
        """Loads the Jinja2 template."""
        super().load()

        _path = str(
            importlib_resources.files("distilabel")
            / "steps"
            / "tasks"
            / "templates"
            / "genstruct.jinja2"
        )

        self._template = Template(open(_path).read())

    @property
    def inputs(self) -> List[str]:
        """The inputs for the task are the `title` and the `content`."""
        return ["title", "content"]

    def format_input(self, input: Dict[str, Any]) -> "ChatType":
        """The input is formatted as a `ChatType` assuming that the instruction
        is the first interaction from the user within a conversation."""
        return [
            {
                "role": "user",
                "content": self._template.render(  # type: ignore
                    title=input["title"], content=input["content"]
                ),
            }
        ]

    @property
    def outputs(self) -> List[str]:
        """The output for the task are the `user` instruction based on the provided document
        and the `assistant` response based on the user's instruction."""
        return ["user", "assistant", "model_name"]

    def format_output(
        self, output: Union[str, None], input: Dict[str, Any]
    ) -> Dict[str, Any]:
        """The output is formatted so that both the user and the assistant messages are
        captured.

        Args:
            output: the raw output of the LLM.
            input: the input to the task. Used for obtaining the number of responses.

        Returns:
            A dict with the keys `user` and `assistant` containing the content for each role.
        """
        if output is None:
            return {"user": None, "assistant": None}

        matches = re.search(_PARSE_GENSTRUCT_OUTPUT_REGEX, output, re.DOTALL)
        if not matches:
            return {"user": None, "assistant": None}

        return {
            "user": matches.group(1).strip(),
            "assistant": matches.group(2).strip(),
        }
inputs property

任务的输入是 titlecontent

outputs property

任务的输出是基于提供的文档的 user 指令和基于用户指令的 assistant 回复。

load()

加载 Jinja2 模板。

源代码位于 src/distilabel/steps/tasks/genstruct.py
def load(self) -> None:
    """Loads the Jinja2 template."""
    super().load()

    _path = str(
        importlib_resources.files("distilabel")
        / "steps"
        / "tasks"
        / "templates"
        / "genstruct.jinja2"
    )

    self._template = Template(open(_path).read())
format_input(input)

输入被格式化为 ChatType,假设指令是用户在对话中的首次互动。

源代码位于 src/distilabel/steps/tasks/genstruct.py
def format_input(self, input: Dict[str, Any]) -> "ChatType":
    """The input is formatted as a `ChatType` assuming that the instruction
    is the first interaction from the user within a conversation."""
    return [
        {
            "role": "user",
            "content": self._template.render(  # type: ignore
                title=input["title"], content=input["content"]
            ),
        }
    ]
format_output(output, input)

输出被格式化,以便用户和助手的消息都被捕获。

参数

名称 类型 描述 默认值
output Union[str, None]

LLM 的原始输出。

必需
input Dict[str, Any]

任务的输入。用于获取响应数量。

必需

返回

类型 描述
Dict[str, Any]

一个字典,包含键 userassistant,分别包含每个角色的内容。

源代码位于 src/distilabel/steps/tasks/genstruct.py
def format_output(
    self, output: Union[str, None], input: Dict[str, Any]
) -> Dict[str, Any]:
    """The output is formatted so that both the user and the assistant messages are
    captured.

    Args:
        output: the raw output of the LLM.
        input: the input to the task. Used for obtaining the number of responses.

    Returns:
        A dict with the keys `user` and `assistant` containing the content for each role.
    """
    if output is None:
        return {"user": None, "assistant": None}

    matches = re.search(_PARSE_GENSTRUCT_OUTPUT_REGEX, output, re.DOTALL)
    if not matches:
        return {"user": None, "assistant": None}

    return {
        "user": matches.group(1).strip(),
        "assistant": matches.group(2).strip(),
    }

ImageGeneration

Bases: ImageTask

使用图像到文本模型根据提示生成图像。

ImageGeneration 是一个预定义的任务,允许从提示生成图像。它适用于 distilabel.models.image_generation 下定义的任何 image_generation,即已实现的允许图像生成的模型。默认情况下,图像以 base64 字符串格式生成,在数据集生成后,可以使用 Distiset.transform_columns_to_image 将图像自动转换为 PIL.Image.Image。请查看文档中 Image Generation with distilabel 示例以获取更多信息。使用 save_artifacts 属性,可以将图像保存在 hugging face hub 存储库的 artifacts 文件夹中。

属性

名称 类型 描述
save_artifacts bool

布尔值,用于将图像 artifacts 保存在其文件夹中。否则,图像的 base64 表示将作为字符串保存。默认为 False。

image_format str

PIL 支持的任何格式。默认为 JPEG

输入列
  • prompt (str): 列名为 prompt 的列,包含生成图像的提示。
输出列
  • image (str): 生成的图像。最初是一个 base64 字符串,以便在管道运行期间保持简单,但在管道结束时返回 distiset 后,可以通过调用 distiset.transform_columns_to_image(<IMAGE_COLUMN>) 将其转换为 Image 对象。
  • image_path (str): 图像保存的路径。仅当 save_artifacts 为 True 时可用。
  • model_name (str): 用于生成图像的模型的名称。
类别
  • image-generation

示例

从提示生成图像

from distilabel.steps.tasks import ImageGeneration
from distilabel.models.image_generation import InferenceEndpointsImageGeneration

igm = InferenceEndpointsImageGeneration(
    model_id="black-forest-labs/FLUX.1-schnell"
)

# save_artifacts=True by default in JPEG format, if set to False, the image will be saved as a string.
image_gen = ImageGeneration(image_generation_model=igm)

image_gen.load()

result = next(
    image_gen.process(
        [{"prompt": "a white siamese cat"}]
    )
)

生成图像并将其作为 artifacts 保存在 Hugging Face Hub 存储库中

from distilabel.steps.tasks import ImageGeneration
# Select the Image Generation model to use
from distilabel.models.image_generation import OpenAIImageGeneration

igm = OpenAIImageGeneration(
    model="dall-e-3",
    api_key="api.key",
    generation_kwargs={
        "size": "1024x1024",
        "quality": "standard",
        "style": "natural"
    }
)

# save_artifacts=True by default in JPEG format, if set to False, the image will be saved as a string.
image_gen = ImageGeneration(
    image_generation_model=igm,
    save_artifacts=True,
    image_format="JPEG"  # By default will use JPEG, the options available can be seen in PIL documentation.
)

image_gen.load()

result = next(
    image_gen.process(
        [{"prompt": "a white siamese cat"}]
    )
)
源代码位于 src/distilabel/steps/tasks/image_generation.py
class ImageGeneration(ImageTask):
    """Image generation with an image to text model given a prompt.

    `ImageGeneration` is a pre-defined task that allows generating images from a prompt.
    It works with any of the `image_generation` defined under `distilabel.models.image_generation`,
    the models implemented models that allow image generation.
    By default, the images are generated as a base64 string format, and after the dataset
    has been generated, the images can be automatically transformed to `PIL.Image.Image` using
    `Distiset.transform_columns_to_image`. Take a look at the `Image Generation with distilabel`
    example in the documentation for more information.
    Using the `save_artifacts` attribute, the images can be saved on the artifacts folder in the
    hugging face hub repository.

    Attributes:
        save_artifacts: Bool value to save the image artifacts on its folder.
            Otherwise, the base64 representation of the image will be saved as
            a string. Defaults to False.
        image_format: Any of the formats supported by PIL. Defaults to `JPEG`.

    Input columns:
        - prompt (str): A column named prompt with the prompts to generate the images.

    Output columns:
        - image (`str`): The generated image. Initially is a base64 string, for simplicity
            during the pipeline run, but this can be transformed to an Image object after
            distiset is returned at the end of a pipeline by calling
            `distiset.transform_columns_to_image(<IMAGE_COLUMN>)`.
        - image_path (`str`): The path where the image is saved. Only available if `save_artifacts`
            is True.
        - model_name (`str`): The name of the model used to generate the image.

    Categories:
        - image-generation

    Examples:
        Generate an image from a prompt:

        ```python
        from distilabel.steps.tasks import ImageGeneration
        from distilabel.models.image_generation import InferenceEndpointsImageGeneration

        igm = InferenceEndpointsImageGeneration(
            model_id="black-forest-labs/FLUX.1-schnell"
        )

        # save_artifacts=True by default in JPEG format, if set to False, the image will be saved as a string.
        image_gen = ImageGeneration(image_generation_model=igm)

        image_gen.load()

        result = next(
            image_gen.process(
                [{"prompt": "a white siamese cat"}]
            )
        )
        ```

        Generate an image and save them as artifacts in a Hugging Face Hub repository:

        ```python
        from distilabel.steps.tasks import ImageGeneration
        # Select the Image Generation model to use
        from distilabel.models.image_generation import OpenAIImageGeneration

        igm = OpenAIImageGeneration(
            model="dall-e-3",
            api_key="api.key",
            generation_kwargs={
                "size": "1024x1024",
                "quality": "standard",
                "style": "natural"
            }
        )

        # save_artifacts=True by default in JPEG format, if set to False, the image will be saved as a string.
        image_gen = ImageGeneration(
            image_generation_model=igm,
            save_artifacts=True,
            image_format="JPEG"  # By default will use JPEG, the options available can be seen in PIL documentation.
        )

        image_gen.load()

        result = next(
            image_gen.process(
                [{"prompt": "a white siamese cat"}]
            )
        )
        ```
    """

    save_artifacts: bool = False
    image_format: str = "JPEG"

    def load(self) -> None:
        from distilabel.models.image_generation.utils import image_from_str

        super().load()

        self._image_from_str = image_from_str

    @property
    def inputs(self) -> "StepColumns":
        return ["prompt"]

    @property
    def outputs(self) -> "StepColumns":
        return {
            "image": True,
            "image_path": False,
            "model_name": True,
        }

    def format_input(self, input: dict[str, any]) -> str:
        return input["prompt"]

    def format_output(
        self, output: dict[str, any], input: dict[str, any]
    ) -> dict[str, any]:
        image = None
        if img_str := output.get("images"):
            image = img_str[0]  # Grab only the first image

        return {"image": image, "model_name": self.llm.model_name}

    def save(self, **kwargs):
        if not self.save_artifacts:
            from distilabel.utils.serialization import _Serializable

            super(_Serializable).save(**kwargs)

    def process(self, inputs: StepInput) -> "StepOutput":
        formatted_inputs = self._format_inputs(inputs)

        outputs = self.llm.generate_outputs(
            inputs=formatted_inputs,
            num_generations=self.num_generations,
            **self.llm.get_generation_kwargs(),
        )

        task_outputs = []
        for input, input_outputs in zip(inputs, outputs):
            formatted_outputs = self._format_outputs(input_outputs, input)
            for formatted_output in formatted_outputs:
                if self.save_artifacts and (
                    image := formatted_output.get("image", None)
                ):
                    # use prompt as filename
                    prompt_hash = hashlib.md5(input["prompt"].encode()).hexdigest()
                    # Build PIL image to save it
                    image = self._image_from_str(image)

                    self.save_artifact(
                        name="images",
                        write_function=lambda path,
                        prompt_hash=prompt_hash,
                        img=image: img.save(
                            path / f"{prompt_hash}.{self.image_format.lower()}",
                            format=self.image_format,
                        ),
                        metadata={"type": "image"},
                    )
                    formatted_output["image_path"] = (
                        f"artifacts/{self.name}/images/{prompt_hash}.{self.image_format.lower()}"
                    )

                task_outputs.append(
                    {**input, **formatted_output, "model_name": self.llm.model_name}
                )
        yield task_outputs

BitextRetrievalGenerator

Bases: _EmbeddingDataGenerator

使用 LLM 生成双语文本检索数据,以便稍后训练嵌入模型。

BitextRetrievalGenerator 是一个 GeneratorTask,它使用 LLM 生成双语文本检索数据,以便稍后训练嵌入模型。此任务基于论文 "Improving Text Embeddings with Large Language Models",数据根据提供的属性生成,如果未提供,则随机抽样。

属性

名称 类型 描述
source_language str

要生成的数据的源语言,可以是 https://aclanthology.org/2020.acl-main.747.pdf 附录 A 中 XLM-R 列表中检索到的任何语言。

target_language str

要生成的数据的目标语言,可以是 https://aclanthology.org/2020.acl-main.747.pdf 附录 A 中 XLM-R 列表中检索到的任何语言。

unit Optional[Literal['sentence', 'phrase', 'passage']]

要生成的数据的单元,可以是 sentencephrasepassage。默认为 None,表示将随机抽样。

difficulty Optional[Literal['elementary school', 'high school', 'college']]

要生成的查询的难度,可以是 elementary schoolhigh schoolcollege。默认为 None,表示将随机抽样。

high_score Optional[Literal['4', '4.5', '5']]

要生成的查询的高分,可以是 44.55。默认为 None,表示将随机抽样。

low_score Optional[Literal['2.5', '3', '3.5']]

要生成的查询的低分,可以是 2.533.5。默认为 None,表示将随机抽样。

seed int

随机种子,用于在 format_input 方法中进行任何抽样时设置。

输出列
  • S1 (str): LLM 生成的第一个句子。
  • S2 (str): LLM 生成的第二个句子。
  • S3 (str): LLM 生成的第三个句子。
  • model_name (str): 用于生成双语文本检索数据的模型的名称。

示例

生成双语文本检索数据以训练嵌入模型

from distilabel.pipeline import Pipeline
from distilabel.steps.tasks import BitextRetrievalGenerator

with Pipeline("my-pipeline") as pipeline:
    task = BitextRetrievalGenerator(
        source_language="English",
        target_language="Spanish",
        unit="sentence",
        difficulty="elementary school",
        high_score="4",
        low_score="2.5",
        llm=...,
    )

    ...

    task >> ...
源代码位于 src/distilabel/steps/tasks/improving_text_embeddings.py
class BitextRetrievalGenerator(_EmbeddingDataGenerator):
    """Generate bitext retrieval data with an `LLM` to later on train an embedding model.

    `BitextRetrievalGenerator` is a `GeneratorTask` that generates bitext retrieval data with an
    `LLM` to later on train an embedding model. The task is based on the paper "Improving
    Text Embeddings with Large Language Models" and the data is generated based on the
    provided attributes, or randomly sampled if not provided.

    Attributes:
        source_language: The source language of the data to be generated, which can be any of the languages
            retrieved from the list of XLM-R in the Appendix A of https://aclanthology.org/2020.acl-main.747.pdf.
        target_language: The target language of the data to be generated, which can be any of the languages
            retrieved from the list of XLM-R in the Appendix A of https://aclanthology.org/2020.acl-main.747.pdf.
        unit: The unit of the data to be generated, which can be `sentence`, `phrase`, or `passage`.
            Defaults to `None`, meaning that it will be randomly sampled.
        difficulty: The difficulty of the query to be generated, which can be `elementary school`, `high school`, or `college`.
            Defaults to `None`, meaning that it will be randomly sampled.
        high_score: The high score of the query to be generated, which can be `4`, `4.5`, or `5`.
            Defaults to `None`, meaning that it will be randomly sampled.
        low_score: The low score of the query to be generated, which can be `2.5`, `3`, or `3.5`.
            Defaults to `None`, meaning that it will be randomly sampled.
        seed: The random seed to be set in case there's any sampling within the `format_input` method.

    Output columns:
        - S1 (`str`): the first sentence generated by the `LLM`.
        - S2 (`str`): the second sentence generated by the `LLM`.
        - S3 (`str`): the third sentence generated by the `LLM`.
        - model_name (`str`): the name of the model used to generate the bitext retrieval
            data.

    Examples:
        Generate bitext retrieval data for training embedding models:

        ```python
        from distilabel.pipeline import Pipeline
        from distilabel.steps.tasks import BitextRetrievalGenerator

        with Pipeline("my-pipeline") as pipeline:
            task = BitextRetrievalGenerator(
                source_language="English",
                target_language="Spanish",
                unit="sentence",
                difficulty="elementary school",
                high_score="4",
                low_score="2.5",
                llm=...,
            )

            ...

            task >> ...
        ```
    """

    source_language: str = Field(
        default="English",
        description="The languages are retrieved from the list of XLM-R in the Appendix A of https://aclanthology.org/2020.acl-main.747.pdf",
    )
    target_language: str = Field(
        default=...,
        description="The languages are retrieved from the list of XLM-R in the Appendix A of https://aclanthology.org/2020.acl-main.747.pdf",
    )

    unit: Optional[Literal["sentence", "phrase", "passage"]] = None
    difficulty: Optional[Literal["elementary school", "high school", "college"]] = None
    high_score: Optional[Literal["4", "4.5", "5"]] = None
    low_score: Optional[Literal["2.5", "3", "3.5"]] = None

    _template_name: str = PrivateAttr(default="bitext-retrieval")
    _can_be_used_with_offline_batch_generation = True

    @property
    def prompt(self) -> ChatType:
        """Contains the `prompt` to be used in the `process` method, rendering the `_template`; and
        formatted as an OpenAI formatted chat i.e. a `ChatType`, assuming that there's only one turn,
        being from the user with the content being the rendered `_template`.
        """
        return [
            {
                "role": "user",
                "content": self._template.render(  # type: ignore
                    source_language=self.source_language,
                    target_language=self.target_language,
                    unit=self.unit or random.choice(["sentence", "phrase", "passage"]),
                    difficulty=self.difficulty
                    or random.choice(["elementary school", "high school", "college"]),
                    high_score=self.high_score or random.choice(["4", "4.5", "5"]),
                    low_score=self.low_score or random.choice(["2.5", "3", "3.5"]),
                ).strip(),
            }
        ]  # type: ignore

    @property
    def keys(self) -> List[str]:
        """Contains the `keys` that will be parsed from the `LLM` output into a Python dict."""
        return ["S1", "S2", "S3"]
prompt property

包含要在 process 方法中使用的 prompt,渲染 _template;并格式化为 OpenAI 格式的聊天,即 ChatType,假设只有一个回合,来自用户,内容是渲染后的 _template

keys property

包含将从 LLM 输出解析到 Python 字典中的 keys

GenerateLongTextMatchingData

Bases: _EmbeddingDataGeneration

使用 LLM 生成长文本匹配数据,以便稍后训练嵌入模型。

GenerateLongTextMatchingData 是一个 Task,它使用 LLM 生成长文本匹配数据,以便稍后训练嵌入模型。此任务基于论文 "Improving Text Embeddings with Large Language Models",数据根据提供的属性生成,如果未提供,则随机抽样。

注意

理想情况下,此任务应与 EmbeddingTaskGeneratorflatten_tasks=True 以及 category="text-matching-long" 一起使用;以便 LLM 生成一个任务列表,该列表被展平,以便每行包含一个用于 text-matching-long 类别的任务。

属性

名称 类型 描述
language str

要生成的数据的语言,可以是 https://aclanthology.org/2020.acl-main.747.pdf 附录 A 中 XLM-R 列表中检索到的任何语言。

seed int

随机种子,用于在 format_input 方法中进行任何抽样时设置。请注意,在此任务中,seed 无效,因为没有抽样参数。

输入列
  • task (str): 要在生成中使用的任务描述。
输出列
  • input (str): 由 LLM 生成的输入。
  • positive_document (str): 由 LLM 生成的正向文档。
  • model_name (str): 用于生成长文本匹配数据的模型的名称。
参考

示例

生成用于训练嵌入模型的合成长文本匹配数据

from distilabel.pipeline import Pipeline
from distilabel.steps.tasks import EmbeddingTaskGenerator, GenerateLongTextMatchingData

with Pipeline("my-pipeline") as pipeline:
    task = EmbeddingTaskGenerator(
        category="text-matching-long",
        flatten_tasks=True,
        llm=...,  # LLM instance
    )

    generate = GenerateLongTextMatchingData(
        language="English",
        llm=...,  # LLM instance
    )

    task >> generate
源代码位于 src/distilabel/steps/tasks/improving_text_embeddings.py
class GenerateLongTextMatchingData(_EmbeddingDataGeneration):
    """Generate long text matching data with an `LLM` to later on train an embedding model.

    `GenerateLongTextMatchingData` is a `Task` that generates long text matching data with an
    `LLM` to later on train an embedding model. The task is based on the paper "Improving
    Text Embeddings with Large Language Models" and the data is generated based on the
    provided attributes, or randomly sampled if not provided.

    Note:
        Ideally this task should be used with `EmbeddingTaskGenerator` with `flatten_tasks=True`
        with the `category="text-matching-long"`; so that the `LLM` generates a list of tasks that
        are flattened so that each row contains a single task for the text-matching-long category.

    Attributes:
        language: The language of the data to be generated, which can be any of the languages
            retrieved from the list of XLM-R in the Appendix A of https://aclanthology.org/2020.acl-main.747.pdf.
        seed: The random seed to be set in case there's any sampling within the `format_input` method.
            Note that in this task the `seed` has no effect since there are no sampling params.

    Input columns:
        - task (`str`): The task description to be used in the generation.

    Output columns:
        - input (`str`): the input generated by the `LLM`.
        - positive_document (`str`): the positive document generated by the `LLM`.
        - model_name (`str`): the name of the model used to generate the long text matching
            data.

    References:
        - [Improving Text Embeddings with Large Language Models](https://arxiv.org/abs/2401.00368)

    Examples:
        Generate synthetic long text matching data for training embedding models:

        ```python
        from distilabel.pipeline import Pipeline
        from distilabel.steps.tasks import EmbeddingTaskGenerator, GenerateLongTextMatchingData

        with Pipeline("my-pipeline") as pipeline:
            task = EmbeddingTaskGenerator(
                category="text-matching-long",
                flatten_tasks=True,
                llm=...,  # LLM instance
            )

            generate = GenerateLongTextMatchingData(
                language="English",
                llm=...,  # LLM instance
            )

            task >> generate
        ```
    """

    language: str = Field(
        default="English",
        description="The languages are retrieved from the list of XLM-R in the Appendix A of https://aclanthology.org/2020.acl-main.747.pdf",
    )

    _template_name: str = PrivateAttr(default="long-text-matching")
    _can_be_used_with_offline_batch_generation = True

    def format_input(self, input: Dict[str, Any]) -> ChatType:
        """Method to format the input based on the `task` and the provided attributes, or just
        randomly sampling those if not provided. This method will render the `_template` with
        the provided arguments and return an OpenAI formatted chat i.e. a `ChatType`, assuming that
        there's only one turn, being from the user with the content being the rendered `_template`.

        Args:
            input: The input dictionary containing the `task` to be used in the `_template`.

        Returns:
            A list with a single chat containing the user's message with the rendered `_template`.
        """
        return [
            {
                "role": "user",
                "content": self._template.render(  # type: ignore
                    task=input["task"],
                    language=self.language,
                ).strip(),
            }
        ]

    @property
    def keys(self) -> List[str]:
        """Contains the `keys` that will be parsed from the `LLM` output into a Python dict."""
        return ["input", "positive_document"]
keys property

包含将从 LLM 输出解析到 Python 字典中的 keys

format_input(input)

基于 task 和提供的属性格式化输入的方法,或者如果未提供,则仅随机抽样这些属性。此方法将使用提供的参数渲染 _template,并返回 OpenAI 格式的聊天,即 ChatType,假设只有一个回合,来自用户,内容是渲染后的 _template

参数

名称 类型 描述 默认值
input Dict[str, Any]

包含要在 _template 中使用的 task 的输入字典。

必需

返回

类型 描述
ChatType

包含用户消息的单个聊天的列表,其中用户消息包含渲染后的 _template

源代码位于 src/distilabel/steps/tasks/improving_text_embeddings.py
def format_input(self, input: Dict[str, Any]) -> ChatType:
    """Method to format the input based on the `task` and the provided attributes, or just
    randomly sampling those if not provided. This method will render the `_template` with
    the provided arguments and return an OpenAI formatted chat i.e. a `ChatType`, assuming that
    there's only one turn, being from the user with the content being the rendered `_template`.

    Args:
        input: The input dictionary containing the `task` to be used in the `_template`.

    Returns:
        A list with a single chat containing the user's message with the rendered `_template`.
    """
    return [
        {
            "role": "user",
            "content": self._template.render(  # type: ignore
                task=input["task"],
                language=self.language,
            ).strip(),
        }
    ]

GenerateShortTextMatchingData

Bases: _EmbeddingDataGeneration

使用 LLM 生成短文本匹配数据,以便稍后训练嵌入模型。

GenerateShortTextMatchingData 是一个 Task,它使用 LLM 生成短文本匹配数据,以便稍后训练嵌入模型。此任务基于论文 "Improving Text Embeddings with Large Language Models",数据根据提供的属性生成,如果未提供,则随机抽样。

注意

理想情况下,此任务应与 EmbeddingTaskGeneratorflatten_tasks=True 以及 category="text-matching-short" 一起使用;以便 LLM 生成一个任务列表,该列表被展平,以便每行包含一个用于 text-matching-short 类别的任务。

属性

名称 类型 描述
language str

要生成的数据的语言,可以是 https://aclanthology.org/2020.acl-main.747.pdf 附录 A 中 XLM-R 列表中检索到的任何语言。

seed int

随机种子,用于在 format_input 方法中进行任何抽样时设置。请注意,在此任务中,seed 无效,因为没有抽样参数。

输入列
  • task (str): 要在生成中使用的任务描述。
输出列
  • input (str): 由 LLM 生成的输入。
  • positive_document (str): 由 LLM 生成的正向文档。
  • model_name (str): 用于生成短文本匹配数据的模型的名称。
参考

示例

生成用于训练嵌入模型的合成短文本匹配数据

from distilabel.pipeline import Pipeline
from distilabel.steps.tasks import EmbeddingTaskGenerator, GenerateShortTextMatchingData

with Pipeline("my-pipeline") as pipeline:
    task = EmbeddingTaskGenerator(
        category="text-matching-short",
        flatten_tasks=True,
        llm=...,  # LLM instance
    )

    generate = GenerateShortTextMatchingData(
        language="English",
        llm=...,  # LLM instance
    )

    task >> generate
源代码位于 src/distilabel/steps/tasks/improving_text_embeddings.py
class GenerateShortTextMatchingData(_EmbeddingDataGeneration):
    """Generate short text matching data with an `LLM` to later on train an embedding model.

    `GenerateShortTextMatchingData` is a `Task` that generates short text matching data with an
    `LLM` to later on train an embedding model. The task is based on the paper "Improving
    Text Embeddings with Large Language Models" and the data is generated based on the
    provided attributes, or randomly sampled if not provided.

    Note:
        Ideally this task should be used with `EmbeddingTaskGenerator` with `flatten_tasks=True`
        with the `category="text-matching-short"`; so that the `LLM` generates a list of tasks that
        are flattened so that each row contains a single task for the text-matching-short category.

    Attributes:
        language: The language of the data to be generated, which can be any of the languages
            retrieved from the list of XLM-R in the Appendix A of https://aclanthology.org/2020.acl-main.747.pdf.
        seed: The random seed to be set in case there's any sampling within the `format_input` method.
            Note that in this task the `seed` has no effect since there are no sampling params.

    Input columns:
        - task (`str`): The task description to be used in the generation.

    Output columns:
        - input (`str`): the input generated by the `LLM`.
        - positive_document (`str`): the positive document generated by the `LLM`.
        - model_name (`str`): the name of the model used to generate the short text matching
            data.

    References:
        - [Improving Text Embeddings with Large Language Models](https://arxiv.org/abs/2401.00368)

    Examples:
        Generate synthetic short text matching data for training embedding models:

        ```python
        from distilabel.pipeline import Pipeline
        from distilabel.steps.tasks import EmbeddingTaskGenerator, GenerateShortTextMatchingData

        with Pipeline("my-pipeline") as pipeline:
            task = EmbeddingTaskGenerator(
                category="text-matching-short",
                flatten_tasks=True,
                llm=...,  # LLM instance
            )

            generate = GenerateShortTextMatchingData(
                language="English",
                llm=...,  # LLM instance
            )

            task >> generate
        ```
    """

    language: str = Field(
        default="English",
        description="The languages are retrieved from the list of XLM-R in the Appendix A of https://aclanthology.org/2020.acl-main.747.pdf",
    )

    _template_name: str = PrivateAttr(default="short-text-matching")
    _can_be_used_with_offline_batch_generation = True

    def format_input(self, input: Dict[str, Any]) -> ChatType:
        """Method to format the input based on the `task` and the provided attributes, or just
        randomly sampling those if not provided. This method will render the `_template` with
                the provided arguments and return an OpenAI formatted chat i.e. a `ChatType`, assuming that
                there's only one turn, being from the user with the content being the rendered `_template`.

                Args:
                    input: The input dictionary containing the `task` to be used in the `_template`.

                Returns:
                    A list with a single chat containing the user's message with the rendered `_template`.
        """
        return [
            {
                "role": "user",
                "content": self._template.render(  # type: ignore
                    task=input["task"],
                    language=self.language,
                ).strip(),
            }
        ]

    @property
    def keys(self) -> List[str]:
        """Contains the `keys` that will be parsed from the `LLM` output into a Python dict."""
        return ["input", "positive_document"]
keys property

包含将从 LLM 输出解析到 Python 字典中的 keys

format_input(input)

基于 task 和提供的属性格式化输入的方法,或者如果未提供,则仅随机抽样这些属性。此方法将使用提供的参数渲染 _template,并返回 OpenAI 格式的聊天,即 ChatType,假设只有一个回合,来自用户,内容是渲染后的 _template

    Args:
        input: The input dictionary containing the `task` to be used in the `_template`.

    Returns:
        A list with a single chat containing the user's message with the rendered `_template`.
源代码位于 src/distilabel/steps/tasks/improving_text_embeddings.py
def format_input(self, input: Dict[str, Any]) -> ChatType:
    """Method to format the input based on the `task` and the provided attributes, or just
    randomly sampling those if not provided. This method will render the `_template` with
            the provided arguments and return an OpenAI formatted chat i.e. a `ChatType`, assuming that
            there's only one turn, being from the user with the content being the rendered `_template`.

            Args:
                input: The input dictionary containing the `task` to be used in the `_template`.

            Returns:
                A list with a single chat containing the user's message with the rendered `_template`.
    """
    return [
        {
            "role": "user",
            "content": self._template.render(  # type: ignore
                task=input["task"],
                language=self.language,
            ).strip(),
        }
    ]

GenerateTextClassificationData

Bases: _EmbeddingDataGeneration

使用 LLM 生成文本分类数据,以便稍后训练嵌入模型。

GenerateTextClassificationData 是一个 Task,它使用 LLM 生成文本分类数据,以便稍后训练嵌入模型。此任务基于论文 "Improving Text Embeddings with Large Language Models",数据根据提供的属性生成,如果未提供,则随机抽样。

注意

理想情况下,此任务应与 EmbeddingTaskGeneratorflatten_tasks=True 以及 category="text-classification" 一起使用;以便 LLM 生成一个任务列表,该列表被展平,以便每行包含一个用于 text-classification 类别的任务。

属性

名称 类型 描述
language str

要生成的数据的语言,可以是 https://aclanthology.org/2020.acl-main.747.pdf 附录 A 中 XLM-R 列表中检索到的任何语言。

difficulty Optional[Literal['high school', 'college', 'PhD']]

要生成的查询的难度,可以是 high schoolcollegePhD。默认为 None,表示将随机抽样。

clarity Optional[Literal['clear', 'understandable with some effort', 'ambiguous']]

要生成的查询的清晰度,可以是 clearunderstandable with some effortambiguous。默认为 None,表示将随机抽样。

seed int

随机种子,用于在 format_input 方法中进行任何抽样时设置。

输入列
  • task (str): 要在生成中使用的任务描述。
输出列
  • input_text (str): 由 LLM 生成的输入文本。
  • label (str): 由 LLM 生成的标签。
  • misleading_label (str): 由 LLM 生成的误导性标签。
  • model_name (str): 用于生成文本分类数据的模型的名称。
参考

示例

生成用于训练嵌入模型的合成文本分类数据

from distilabel.pipeline import Pipeline
from distilabel.steps.tasks import EmbeddingTaskGenerator, GenerateTextClassificationData

with Pipeline("my-pipeline") as pipeline:
    task = EmbeddingTaskGenerator(
        category="text-classification",
        flatten_tasks=True,
        llm=...,  # LLM instance
    )

    generate = GenerateTextClassificationData(
        language="English",
        difficulty="high school",
        clarity="clear",
        llm=...,  # LLM instance
    )

    task >> generate
源代码位于 src/distilabel/steps/tasks/improving_text_embeddings.py
class GenerateTextClassificationData(_EmbeddingDataGeneration):
    """Generate text classification data with an `LLM` to later on train an embedding model.

    `GenerateTextClassificationData` is a `Task` that generates text classification data with an
    `LLM` to later on train an embedding model. The task is based on the paper "Improving
    Text Embeddings with Large Language Models" and the data is generated based on the
    provided attributes, or randomly sampled if not provided.

    Note:
        Ideally this task should be used with `EmbeddingTaskGenerator` with `flatten_tasks=True`
        with the `category="text-classification"`; so that the `LLM` generates a list of tasks that
        are flattened so that each row contains a single task for the text-classification category.

    Attributes:
        language: The language of the data to be generated, which can be any of the languages
            retrieved from the list of XLM-R in the Appendix A of https://aclanthology.org/2020.acl-main.747.pdf.
        difficulty: The difficulty of the query to be generated, which can be `high school`, `college`, or `PhD`.
            Defaults to `None`, meaning that it will be randomly sampled.
        clarity: The clarity of the query to be generated, which can be `clear`, `understandable with some effort`,
            or `ambiguous`. Defaults to `None`, meaning that it will be randomly sampled.
        seed: The random seed to be set in case there's any sampling within the `format_input` method.

    Input columns:
        - task (`str`): The task description to be used in the generation.

    Output columns:
        - input_text (`str`): the input text generated by the `LLM`.
        - label (`str`): the label generated by the `LLM`.
        - misleading_label (`str`): the misleading label generated by the `LLM`.
        - model_name (`str`): the name of the model used to generate the text classification
            data.

    References:
        - [Improving Text Embeddings with Large Language Models](https://arxiv.org/abs/2401.00368)

    Examples:
        Generate synthetic text classification data for training embedding models:

        ```python
        from distilabel.pipeline import Pipeline
        from distilabel.steps.tasks import EmbeddingTaskGenerator, GenerateTextClassificationData

        with Pipeline("my-pipeline") as pipeline:
            task = EmbeddingTaskGenerator(
                category="text-classification",
                flatten_tasks=True,
                llm=...,  # LLM instance
            )

            generate = GenerateTextClassificationData(
                language="English",
                difficulty="high school",
                clarity="clear",
                llm=...,  # LLM instance
            )

            task >> generate
        ```
    """

    language: str = Field(
        default="English",
        description="The languages are retrieved from the list of XLM-R in the Appendix A of https://aclanthology.org/2020.acl-main.747.pdf",
    )

    difficulty: Optional[Literal["high school", "college", "PhD"]] = None
    clarity: Optional[
        Literal["clear", "understandable with some effort", "ambiguous"]
    ] = None

    _template_name: str = PrivateAttr(default="text-classification")
    _can_be_used_with_offline_batch_generation = True

    def format_input(self, input: Dict[str, Any]) -> ChatType:
        """Method to format the input based on the `task` and the provided attributes, or just
        randomly sampling those if not provided. This method will render the `_template` with
        the provided arguments and return an OpenAI formatted chat i.e. a `ChatType`, assuming that
        there's only one turn, being from the user with the content being the rendered `_template`.

        Args:
            input: The input dictionary containing the `task` to be used in the `_template`.

        Returns:
            A list with a single chat containing the user's message with the rendered `_template`.
        """
        return [
            {
                "role": "user",
                "content": self._template.render(  # type: ignore
                    task=input["task"],
                    language=self.language,
                    difficulty=self.difficulty
                    or random.choice(["high school", "college", "PhD"]),
                    clarity=self.clarity
                    or random.choice(
                        ["clear", "understandable with some effort", "ambiguous"]
                    ),
                ).strip(),
            }
        ]

    @property
    def keys(self) -> List[str]:
        """Contains the `keys` that will be parsed from the `LLM` output into a Python dict."""
        return ["input_text", "label", "misleading_label"]
keys property

包含将从 LLM 输出解析到 Python 字典中的 keys

format_input(input)

基于 task 和提供的属性格式化输入的方法,或者如果未提供,则仅随机抽样这些属性。此方法将使用提供的参数渲染 _template,并返回 OpenAI 格式的聊天,即 ChatType,假设只有一个回合,来自用户,内容是渲染后的 _template

参数

名称 类型 描述 默认值
input Dict[str, Any]

包含要在 _template 中使用的 task 的输入字典。

必需

返回

类型 描述
ChatType

包含用户消息的单个聊天的列表,其中用户消息包含渲染后的 _template

源代码位于 src/distilabel/steps/tasks/improving_text_embeddings.py
def format_input(self, input: Dict[str, Any]) -> ChatType:
    """Method to format the input based on the `task` and the provided attributes, or just
    randomly sampling those if not provided. This method will render the `_template` with
    the provided arguments and return an OpenAI formatted chat i.e. a `ChatType`, assuming that
    there's only one turn, being from the user with the content being the rendered `_template`.

    Args:
        input: The input dictionary containing the `task` to be used in the `_template`.

    Returns:
        A list with a single chat containing the user's message with the rendered `_template`.
    """
    return [
        {
            "role": "user",
            "content": self._template.render(  # type: ignore
                task=input["task"],
                language=self.language,
                difficulty=self.difficulty
                or random.choice(["high school", "college", "PhD"]),
                clarity=self.clarity
                or random.choice(
                    ["clear", "understandable with some effort", "ambiguous"]
                ),
            ).strip(),
        }
    ]

GenerateTextRetrievalData

Bases: _EmbeddingDataGeneration

使用 LLM 生成文本检索数据,以便稍后训练嵌入模型。

GenerateTextRetrievalData 是一个 Task,它使用 LLM 生成文本检索数据,以便稍后训练嵌入模型。此任务基于论文 "Improving Text Embeddings with Large Language Models",数据根据提供的属性生成,如果未提供,则随机抽样。

注意

理想情况下,此任务应与 EmbeddingTaskGeneratorflatten_tasks=True 以及 category="text-retrieval" 一起使用;以便 LLM 生成一个任务列表,该列表被展平,以便每行包含一个用于 text-retrieval 类别的任务。

属性

名称 类型 描述
language str

要生成的数据的语言,可以是 https://aclanthology.org/2020.acl-main.747.pdf 附录 A 中 XLM-R 列表中检索到的任何语言。

query_type Optional[Literal['extremely long-tail', 'long-tail', 'common']]

要生成的查询的类型,可以是 extremely long-taillong-tailcommon。默认为 None,表示将随机抽样。

query_length Optional[Literal['less than 5 words', '5 to 15 words', 'at least 10 words']]

要生成的查询的长度,可以是 less than 5 words5 to 15 wordsat least 10 words。默认为 None,表示将随机抽样。

difficulty Optional[Literal['high school', 'college', 'PhD']]

要生成的查询的难度,可以是 high schoolcollegePhD。默认为 None,表示将随机抽样。

clarity Optional[Literal['clear', 'understandable with some effort', 'ambiguous']]

要生成的查询的清晰度,可以是 clearunderstandable with some effortambiguous。默认为 None,表示将随机抽样。

num_words Optional[Literal[50, 100, 200, 300, 400, 500]]

要生成的查询中的单词数,可以是 50100200300400500。默认为 None,表示将随机抽样。

seed int

随机种子,用于在 format_input 方法中进行任何抽样时设置。

输入列
  • task (str): 要在生成中使用的任务描述。
输出列
  • user_query (str): 由 LLM 生成的用户查询。
  • positive_document (str): 由 LLM 生成的正向文档。
  • hard_negative_document (str): 由 LLM 生成的困难负例文档。
  • model_name (str): 用于生成文本检索数据的模型的名称。
参考

示例

生成用于训练嵌入模型的合成文本检索数据

from distilabel.pipeline import Pipeline
from distilabel.steps.tasks import EmbeddingTaskGenerator, GenerateTextRetrievalData

with Pipeline("my-pipeline") as pipeline:
    task = EmbeddingTaskGenerator(
        category="text-retrieval",
        flatten_tasks=True,
        llm=...,  # LLM instance
    )

    generate = GenerateTextRetrievalData(
        language="English",
        query_type="common",
        query_length="5 to 15 words",
        difficulty="high school",
        clarity="clear",
        num_words=100,
        llm=...,  # LLM instance
    )

    task >> generate
源代码位于 src/distilabel/steps/tasks/improving_text_embeddings.py
class GenerateTextRetrievalData(_EmbeddingDataGeneration):
    """Generate text retrieval data with an `LLM` to later on train an embedding model.

    `GenerateTextRetrievalData` is a `Task` that generates text retrieval data with an
    `LLM` to later on train an embedding model. The task is based on the paper "Improving
    Text Embeddings with Large Language Models" and the data is generated based on the
    provided attributes, or randomly sampled if not provided.

    Note:
        Ideally this task should be used with `EmbeddingTaskGenerator` with `flatten_tasks=True`
        with the `category="text-retrieval"`; so that the `LLM` generates a list of tasks that
        are flattened so that each row contains a single task for the text-retrieval category.

    Attributes:
        language: The language of the data to be generated, which can be any of the languages
            retrieved from the list of XLM-R in the Appendix A of https://aclanthology.org/2020.acl-main.747.pdf.
        query_type: The type of query to be generated, which can be `extremely long-tail`, `long-tail`,
            or `common`. Defaults to `None`, meaning that it will be randomly sampled.
        query_length: The length of the query to be generated, which can be `less than 5 words`, `5 to 15 words`,
            or `at least 10 words`. Defaults to `None`, meaning that it will be randomly sampled.
        difficulty: The difficulty of the query to be generated, which can be `high school`, `college`, or `PhD`.
            Defaults to `None`, meaning that it will be randomly sampled.
        clarity: The clarity of the query to be generated, which can be `clear`, `understandable with some effort`,
            or `ambiguous`. Defaults to `None`, meaning that it will be randomly sampled.
        num_words: The number of words in the query to be generated, which can be `50`, `100`, `200`, `300`, `400`, or `500`.
            Defaults to `None`, meaning that it will be randomly sampled.
        seed: The random seed to be set in case there's any sampling within the `format_input` method.

    Input columns:
        - task (`str`): The task description to be used in the generation.

    Output columns:
        - user_query (`str`): the user query generated by the `LLM`.
        - positive_document (`str`): the positive document generated by the `LLM`.
        - hard_negative_document (`str`): the hard negative document generated by the `LLM`.
        - model_name (`str`): the name of the model used to generate the text retrieval data.

    References:
        - [Improving Text Embeddings with Large Language Models](https://arxiv.org/abs/2401.00368)

    Examples:
        Generate synthetic text retrieval data for training embedding models:

        ```python
        from distilabel.pipeline import Pipeline
        from distilabel.steps.tasks import EmbeddingTaskGenerator, GenerateTextRetrievalData

        with Pipeline("my-pipeline") as pipeline:
            task = EmbeddingTaskGenerator(
                category="text-retrieval",
                flatten_tasks=True,
                llm=...,  # LLM instance
            )

            generate = GenerateTextRetrievalData(
                language="English",
                query_type="common",
                query_length="5 to 15 words",
                difficulty="high school",
                clarity="clear",
                num_words=100,
                llm=...,  # LLM instance
            )

            task >> generate
        ```
    """

    language: str = Field(
        default="English",
        description="The languages are retrieved from the list of XLM-R in the Appendix A of https://aclanthology.org/2020.acl-main.747.pdf",
    )

    query_type: Optional[Literal["extremely long-tail", "long-tail", "common"]] = None
    query_length: Optional[
        Literal["less than 5 words", "5 to 15 words", "at least 10 words"]
    ] = None
    difficulty: Optional[Literal["high school", "college", "PhD"]] = None
    clarity: Optional[
        Literal["clear", "understandable with some effort", "ambiguous"]
    ] = None
    num_words: Optional[Literal[50, 100, 200, 300, 400, 500]] = None

    _template_name: str = PrivateAttr(default="text-retrieval")
    _can_be_used_with_offline_batch_generation = True

    def format_input(self, input: Dict[str, Any]) -> ChatType:
        """Method to format the input based on the `task` and the provided attributes, or just
        randomly sampling those if not provided. This method will render the `_template` with
        the provided arguments and return an OpenAI formatted chat i.e. a `ChatType`, assuming that
        there's only one turn, being from the user with the content being the rendered `_template`.

        Args:
            input: The input dictionary containing the `task` to be used in the `_template`.

        Returns:
            A list with a single chat containing the user's message with the rendered `_template`.
        """
        return [
            {
                "role": "user",
                "content": self._template.render(  # type: ignore
                    task=input["task"],
                    language=self.language,
                    query_type=self.query_type
                    or random.choice(["extremely long-tail", "long-tail", "common"]),
                    query_length=self.query_length
                    or random.choice(
                        ["less than 5 words", "5 to 15 words", "at least 10 words"]
                    ),
                    difficulty=self.difficulty
                    or random.choice(["high school", "college", "PhD"]),
                    clarity=self.clarity
                    or random.choice(
                        ["clear", "understandable with some effort", "ambiguous"]
                    ),
                    num_words=self.num_words
                    or random.choice([50, 100, 200, 300, 400, 500]),
                ).strip(),
            }
        ]

    @property
    def keys(self) -> List[str]:
        """Contains the `keys` that will be parsed from the `LLM` output into a Python dict."""
        return [
            "user_query",
            "positive_document",
            "hard_negative_document",
        ]
keys property

包含将从 LLM 输出解析到 Python 字典中的 keys

format_input(input)

基于 task 和提供的属性格式化输入的方法,或者如果未提供,则仅随机抽样这些属性。此方法将使用提供的参数渲染 _template,并返回 OpenAI 格式的聊天,即 ChatType,假设只有一个回合,来自用户,内容是渲染后的 _template

参数

名称 类型 描述 默认值
input Dict[str, Any]

包含要在 _template 中使用的 task 的输入字典。

必需

返回

类型 描述
ChatType

包含用户消息的单个聊天的列表,其中用户消息包含渲染后的 _template

源代码位于 src/distilabel/steps/tasks/improving_text_embeddings.py
def format_input(self, input: Dict[str, Any]) -> ChatType:
    """Method to format the input based on the `task` and the provided attributes, or just
    randomly sampling those if not provided. This method will render the `_template` with
    the provided arguments and return an OpenAI formatted chat i.e. a `ChatType`, assuming that
    there's only one turn, being from the user with the content being the rendered `_template`.

    Args:
        input: The input dictionary containing the `task` to be used in the `_template`.

    Returns:
        A list with a single chat containing the user's message with the rendered `_template`.
    """
    return [
        {
            "role": "user",
            "content": self._template.render(  # type: ignore
                task=input["task"],
                language=self.language,
                query_type=self.query_type
                or random.choice(["extremely long-tail", "long-tail", "common"]),
                query_length=self.query_length
                or random.choice(
                    ["less than 5 words", "5 to 15 words", "at least 10 words"]
                ),
                difficulty=self.difficulty
                or random.choice(["high school", "college", "PhD"]),
                clarity=self.clarity
                or random.choice(
                    ["clear", "understandable with some effort", "ambiguous"]
                ),
                num_words=self.num_words
                or random.choice([50, 100, 200, 300, 400, 500]),
            ).strip(),
        }
    ]

MonolingualTripletGenerator

Bases: _EmbeddingDataGenerator

使用 LLM 生成单语三元组,以便稍后训练嵌入模型。

MonolingualTripletGenerator 是一个 GeneratorTask,它使用 LLM 生成单语三元组,以便稍后训练嵌入模型。此任务基于论文 "Improving Text Embeddings with Large Language Models",数据根据提供的属性生成,如果未提供,则随机抽样。

属性

名称 类型 描述
language str

要生成的数据的语言,可以是 https://aclanthology.org/2020.acl-main.747.pdf 附录 A 中 XLM-R 列表中检索到的任何语言。

unit Optional[Literal['sentence', 'phrase', 'passage']]

要生成的数据的单元,可以是 sentencephrasepassage。默认为 None,表示将随机抽样。

difficulty Optional[Literal['elementary school', 'high school', 'college']]

要生成的查询的难度,可以是 elementary schoolhigh schoolcollege。默认为 None,表示将随机抽样。

high_score Optional[Literal['4', '4.5', '5']]

要生成的查询的高分,可以是 44.55。默认为 None,表示将随机抽样。

low_score Optional[Literal['2.5', '3', '3.5']]

要生成的查询的低分,可以是 2.533.5。默认为 None,表示将随机抽样。

seed int

随机种子,用于在 format_input 方法中进行任何抽样时设置。

输出列
  • S1 (str): LLM 生成的第一个句子。
  • S2 (str): LLM 生成的第二个句子。
  • S3 (str): LLM 生成的第三个句子。
  • model_name (str): 用于生成单语三元组的模型的名称。

示例

生成用于训练嵌入模型的单语三元组

from distilabel.pipeline import Pipeline
from distilabel.steps.tasks import MonolingualTripletGenerator

with Pipeline("my-pipeline") as pipeline:
    task = MonolingualTripletGenerator(
        language="English",
        unit="sentence",
        difficulty="elementary school",
        high_score="4",
        low_score="2.5",
        llm=...,
    )

    ...

    task >> ...
源代码位于 src/distilabel/steps/tasks/improving_text_embeddings.py
class MonolingualTripletGenerator(_EmbeddingDataGenerator):
    """Generate monolingual triplets with an `LLM` to later on train an embedding model.

    `MonolingualTripletGenerator` is a `GeneratorTask` that generates monolingual triplets with an
    `LLM` to later on train an embedding model. The task is based on the paper "Improving
    Text Embeddings with Large Language Models" and the data is generated based on the
    provided attributes, or randomly sampled if not provided.

    Attributes:
        language: The language of the data to be generated, which can be any of the languages
            retrieved from the list of XLM-R in the Appendix A of https://aclanthology.org/2020.acl-main.747.pdf.
        unit: The unit of the data to be generated, which can be `sentence`, `phrase`, or `passage`.
            Defaults to `None`, meaning that it will be randomly sampled.
        difficulty: The difficulty of the query to be generated, which can be `elementary school`, `high school`, or `college`.
            Defaults to `None`, meaning that it will be randomly sampled.
        high_score: The high score of the query to be generated, which can be `4`, `4.5`, or `5`.
            Defaults to `None`, meaning that it will be randomly sampled.
        low_score: The low score of the query to be generated, which can be `2.5`, `3`, or `3.5`.
            Defaults to `None`, meaning that it will be randomly sampled.
        seed: The random seed to be set in case there's any sampling within the `format_input` method.

    Output columns:
        - S1 (`str`): the first sentence generated by the `LLM`.
        - S2 (`str`): the second sentence generated by the `LLM`.
        - S3 (`str`): the third sentence generated by the `LLM`.
        - model_name (`str`): the name of the model used to generate the monolingual triplets.

    Examples:
        Generate monolingual triplets for training embedding models:

        ```python
        from distilabel.pipeline import Pipeline
        from distilabel.steps.tasks import MonolingualTripletGenerator

        with Pipeline("my-pipeline") as pipeline:
            task = MonolingualTripletGenerator(
                language="English",
                unit="sentence",
                difficulty="elementary school",
                high_score="4",
                low_score="2.5",
                llm=...,
            )

            ...

            task >> ...
        ```
    """

    language: str = Field(
        default="English",
        description="The languages are retrieved from the list of XLM-R in the Appendix A of https://aclanthology.org/2020.acl-main.747.pdf",
    )

    unit: Optional[Literal["sentence", "phrase", "passage"]] = None
    difficulty: Optional[Literal["elementary school", "high school", "college"]] = None
    high_score: Optional[Literal["4", "4.5", "5"]] = None
    low_score: Optional[Literal["2.5", "3", "3.5"]] = None

    _template_name: str = PrivateAttr(default="monolingual-triplet")
    _can_be_used_with_offline_batch_generation = True

    @property
    def prompt(self) -> ChatType:
        """Contains the `prompt` to be used in the `process` method, rendering the `_template`; and
        formatted as an OpenAI formatted chat i.e. a `ChatType`, assuming that there's only one turn,
        being from the user with the content being the rendered `_template`.
        """
        return [
            {
                "role": "user",
                "content": self._template.render(  # type: ignore
                    language=self.language,
                    unit=self.unit or random.choice(["sentence", "phrase", "passage"]),
                    difficulty=self.difficulty
                    or random.choice(["elementary school", "high school", "college"]),
                    high_score=self.high_score or random.choice(["4", "4.5", "5"]),
                    low_score=self.low_score or random.choice(["2.5", "3", "3.5"]),
                ).strip(),
            }
        ]  # type: ignore

    @property
    def keys(self) -> List[str]:
        """Contains the `keys` that will be parsed from the `LLM` output into a Python dict."""
        return ["S1", "S2", "S3"]
prompt property

包含要在 process 方法中使用的 prompt,渲染 _template;并格式化为 OpenAI 格式的聊天,即 ChatType,假设只有一个回合,来自用户,内容是渲染后的 _template

keys property

包含将从 LLM 输出解析到 Python 字典中的 keys

InstructionBacktranslation

基类:Task

使用指令回译进行自我对齐。

属性

名称 类型 描述
_template Optional[Template]

用于指令回译任务的 Jinja2 模板。

输入列
  • instruction (str): 用于评估文本输出的参考指令。
  • generation (str): 要根据给定指令评估的文本输出。
输出列
  • score (str): 基于给定指令的生成结果的分数。
  • reason (str): 提供分数的理由。
  • model_name (str): 用于对生成结果进行评分的模型名称。
类别
  • critique
参考

示例

为给定的指令和生成结果生成分数和理由

from distilabel.steps.tasks import InstructionBacktranslation

instruction_backtranslation = InstructionBacktranslation(
        name="instruction_backtranslation",
        llm=llm,
        input_batch_size=10,
        output_mappings={"model_name": "scoring_model"},
    )
instruction_backtranslation.load()

result = next(
    instruction_backtranslation.process(
        [
            {
                "instruction": "How much is 2+2?",
                "generation": "4",
            }
        ]
    )
)
# result
# [
#     {
#         "instruction": "How much is 2+2?",
#         "generation": "4",
#         "score": 3,
#         "reason": "Reason for the generation.",
#         "model_name": "meta-llama/Meta-Llama-3.1-8B-Instruct",
#     }
# ]
引用
@misc{li2024selfalignmentinstructionbacktranslation,
    title={Self-Alignment with Instruction Backtranslation},
    author={Xian Li and Ping Yu and Chunting Zhou and Timo Schick and Omer Levy and Luke Zettlemoyer and Jason Weston and Mike Lewis},
    year={2024},
    eprint={2308.06259},
    archivePrefix={arXiv},
    primaryClass={cs.CL},
    url={https://arxiv.org/abs/2308.06259},
}
源代码位于 src/distilabel/steps/tasks/instruction_backtranslation.py
class InstructionBacktranslation(Task):
    """Self-Alignment with Instruction Backtranslation.

    Attributes:
        _template: the Jinja2 template to use for the Instruction Backtranslation task.

    Input columns:
        - instruction (`str`): The reference instruction to evaluate the text output.
        - generation (`str`): The text output to evaluate for the given instruction.

    Output columns:
        - score (`str`): The score for the generation based on the given instruction.
        - reason (`str`): The reason for the provided score.
        - model_name (`str`): The model name used to score the generation.

    Categories:
        - critique

    References:
        - [`Self-Alignment with Instruction Backtranslation`](https://arxiv.org/abs/2308.06259)

    Examples:
        Generate a score and reason for a given instruction and generation:

        ```python
        from distilabel.steps.tasks import InstructionBacktranslation

        instruction_backtranslation = InstructionBacktranslation(
                name="instruction_backtranslation",
                llm=llm,
                input_batch_size=10,
                output_mappings={"model_name": "scoring_model"},
            )
        instruction_backtranslation.load()

        result = next(
            instruction_backtranslation.process(
                [
                    {
                        "instruction": "How much is 2+2?",
                        "generation": "4",
                    }
                ]
            )
        )
        # result
        # [
        #     {
        #         "instruction": "How much is 2+2?",
        #         "generation": "4",
        #         "score": 3,
        #         "reason": "Reason for the generation.",
        #         "model_name": "meta-llama/Meta-Llama-3.1-8B-Instruct",
        #     }
        # ]
        ```

    Citations:
        ```
        @misc{li2024selfalignmentinstructionbacktranslation,
            title={Self-Alignment with Instruction Backtranslation},
            author={Xian Li and Ping Yu and Chunting Zhou and Timo Schick and Omer Levy and Luke Zettlemoyer and Jason Weston and Mike Lewis},
            year={2024},
            eprint={2308.06259},
            archivePrefix={arXiv},
            primaryClass={cs.CL},
            url={https://arxiv.org/abs/2308.06259},
        }
        ```
    """

    _template: Optional["Template"] = PrivateAttr(default=...)
    _can_be_used_with_offline_batch_generation = True

    def load(self) -> None:
        """Loads the Jinja2 template."""
        super().load()

        _path = str(
            importlib_resources.files("distilabel")
            / "steps"
            / "tasks"
            / "templates"
            / "instruction-backtranslation.jinja2"
        )

        self._template = Template(open(_path).read())

    @property
    def inputs(self) -> List[str]:
        """The input for the task is the `instruction`, and the `generation` for it."""
        return ["instruction", "generation"]

    def format_input(self, input: Dict[str, Any]) -> "ChatType":
        """The input is formatted as a `ChatType` assuming that the instruction
        is the first interaction from the user within a conversation."""
        return [
            {
                "role": "user",
                "content": self._template.render(  # type: ignore
                    instruction=input["instruction"], generation=input["generation"]
                ),
            },
        ]

    @property
    def outputs(self) -> List[str]:
        """The output for the task is the `score`, `reason` and the `model_name`."""
        return ["score", "reason", "model_name"]

    def format_output(
        self, output: Union[str, None], input: Dict[str, Any]
    ) -> Dict[str, Any]:
        """The output is formatted as a dictionary with the `score` and `reason`. The
        `model_name` will be automatically included within the `process` method of `Task`.

        Args:
            output: a string representing the output of the LLM via the `process` method.
            input: the input to the task, as required by some tasks to format the output.

        Returns:
            A dictionary containing the `score` and the `reason` for the provided `score`.
        """
        pattern = r"(.+?)Score: (\d)"

        matches = None
        if output is not None:
            matches = re.findall(pattern, output, re.DOTALL)
        if matches is None:
            return {"score": None, "reason": None}

        return {
            "score": int(matches[0][1]),
            "reason": matches[0][0].strip(),
        }
inputs property

任务的输入是 instruction 和针对它的 generation

outputs property

任务的输出是 scorereasonmodel_name

load()

加载 Jinja2 模板。

源代码位于 src/distilabel/steps/tasks/instruction_backtranslation.py
def load(self) -> None:
    """Loads the Jinja2 template."""
    super().load()

    _path = str(
        importlib_resources.files("distilabel")
        / "steps"
        / "tasks"
        / "templates"
        / "instruction-backtranslation.jinja2"
    )

    self._template = Template(open(_path).read())
format_input(input)

输入被格式化为 ChatType,假设指令是用户在对话中的首次互动。

源代码位于 src/distilabel/steps/tasks/instruction_backtranslation.py
def format_input(self, input: Dict[str, Any]) -> "ChatType":
    """The input is formatted as a `ChatType` assuming that the instruction
    is the first interaction from the user within a conversation."""
    return [
        {
            "role": "user",
            "content": self._template.render(  # type: ignore
                instruction=input["instruction"], generation=input["generation"]
            ),
        },
    ]
format_output(output, input)

输出被格式化为包含 scorereason 的字典。model_name 将自动包含在 Taskprocess 方法中。

参数

名称 类型 描述 默认值
output Union[str, None]

一个字符串,表示通过 process 方法获得的 LLM 的输出。

必需
input Dict[str, Any]

任务的输入,某些任务需要输入以格式化输出。

必需

返回

类型 描述
Dict[str, Any]

一个字典,包含 score 和为提供的 score 提供的 reason

源代码位于 src/distilabel/steps/tasks/instruction_backtranslation.py
def format_output(
    self, output: Union[str, None], input: Dict[str, Any]
) -> Dict[str, Any]:
    """The output is formatted as a dictionary with the `score` and `reason`. The
    `model_name` will be automatically included within the `process` method of `Task`.

    Args:
        output: a string representing the output of the LLM via the `process` method.
        input: the input to the task, as required by some tasks to format the output.

    Returns:
        A dictionary containing the `score` and the `reason` for the provided `score`.
    """
    pattern = r"(.+?)Score: (\d)"

    matches = None
    if output is not None:
        matches = re.findall(pattern, output, re.DOTALL)
    if matches is None:
        return {"score": None, "reason": None}

    return {
        "score": int(matches[0][1]),
        "reason": matches[0][0].strip(),
    }

Magpie

Bases: Task, MagpieBase

使用指令微调的 LLM 生成对话。

Magpie 是一种简洁的方法,它允许在没有种子数据或特定系统提示的情况下生成用户指令,这得益于指令微调的 LLM 的自回归能力。由于它们是使用由用户消息和期望的助手输出组成的聊天模板进行微调的,因此指令微调的 LLM 学习到在预查询或预指令标记之后是指令。如果将这些预查询标记发送到 LLM 而没有任何用户消息,则 LLM 将继续生成标记,就像它是用户一样。这个技巧允许从指令微调的 LLM 中“提取”指令。在此指令生成后,它可以再次发送到 LLM 以生成本次的助手回复。此过程可以重复 N 次,从而构建多轮对话。此方法在论文“Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing”中描述。

属性

名称 类型 描述
n_turns RuntimeParameter[PositiveInt]

生成的对话将具有的回合数。默认为 1

end_with_user RuntimeParameter[bool]

对话是否应以用户消息结束。默认为 False

include_system_prompt RuntimeParameter[bool]

是否包含在生成的对话中使用的系统提示。默认为 False

only_instruction RuntimeParameter[bool]

是否仅生成指令。如果此参数为 True,则将忽略 n_turns。默认为 False

system_prompt Optional[RuntimeParameter[Union[List[str], Dict[str, str], Dict[str, Tuple[str, float]], str]]]

可选的系统提示,或系统提示列表(将从中随机选择一个),或系统提示字典(将从中随机选择一个),或系统提示字典及其被选择的概率。随机系统提示将按每个输入/输出批次选择。此系统提示可用于指导指令 LLM 的生成,并引导其生成特定主题的指令。默认为 None

运行时参数
  • n_turns: 生成的对话将具有的回合数。默认为 1
  • end_with_user: 对话是否应以用户消息结束。默认为 False
  • include_system_prompt: 是否包含在生成的对话中使用的系统提示。默认为 False
  • only_instruction: 是否仅生成指令。如果此参数为 True,则将忽略 n_turns。默认为 False
  • system_prompt: 可选的系统提示或系统提示列表,可用于引导 LLM 生成特定主题的内容、指导风格等。如果是系统提示列表,则将为每个输入/输出批次选择一个随机系统提示。如果提供的输入包含 system_prompt 列,则将忽略此运行时参数,而使用该列中的参数。默认为 None
  • system_prompt: 可选的系统提示,或系统提示列表(将从中随机选择一个),或系统提示字典(将从中随机选择一个),或系统提示字典及其被选择的概率。随机系统提示将按每个输入/输出批次选择。此系统提示可用于指导指令 LLM 的生成,并引导其生成特定主题的指令。
输入列
  • system_prompt (str, optional): 可选的系统提示,可以提供该提示来指导指令 LLM 的生成,并引导其生成特定主题的指令。
输出列
  • conversation (ChatType): 生成的对话,它是包含角色和消息的聊天项列表。仅当 only_instruction=False 时。
  • instruction (str): 生成的指令,如果 only_instruction=Truen_turns==1
  • response (str): 生成的回复,如果 n_turns==1
  • system_prompt_key (str, optional): 用于生成对话或指令的系统提示的键。仅当 system_prompt 是字典时。
  • model_name (str): 用于生成 conversationinstruction 的模型名称。
类别
  • 文本生成
  • instruction
参考

示例

使用 Llama 3 8B Instruct 和 TransformersLLM 生成指令

from distilabel.models import TransformersLLM
from distilabel.steps.tasks import Magpie

magpie = Magpie(
    llm=TransformersLLM(
        model="meta-llama/Meta-Llama-3-8B-Instruct",
        magpie_pre_query_template="llama3",
        generation_kwargs={
            "temperature": 1.0,
            "max_new_tokens": 64,
        },
        device="mps",
    ),
    only_instruction=True,
)

magpie.load()

result = next(
    magpie.process(
        inputs=[
            {
                "system_prompt": "You're a math expert AI assistant that helps students of secondary school to solve calculus problems."
            },
            {
                "system_prompt": "You're an expert florist AI assistant that helps user to erradicate pests in their crops."
            },
        ]
    )
)
# [
#     {'instruction': "That's me! I'd love some help with solving calculus problems! What kind of calculation are you most effective at? Linear Algebra, derivatives, integrals, optimization?"},
#     {'instruction': 'I was wondering if there are certain flowers and plants that can be used for pest control?'}
# ]

使用 Llama 3 8B Instruct 和 TransformersLLM 生成对话

from distilabel.models import TransformersLLM
from distilabel.steps.tasks import Magpie

magpie = Magpie(
    llm=TransformersLLM(
        model="meta-llama/Meta-Llama-3-8B-Instruct",
        magpie_pre_query_template="llama3",
        generation_kwargs={
            "temperature": 1.0,
            "max_new_tokens": 256,
        },
        device="mps",
    ),
    n_turns=2,
)

magpie.load()

result = next(
    magpie.process(
        inputs=[
            {
                "system_prompt": "You're a math expert AI assistant that helps students of secondary school to solve calculus problems."
            },
            {
                "system_prompt": "You're an expert florist AI assistant that helps user to erradicate pests in their crops."
            },
        ]
    )
)
# [
#     {
#         'conversation': [
#             {'role': 'system', 'content': "You're a math expert AI assistant that helps students of secondary school to solve calculus problems."},
#             {
#                 'role': 'user',
#                 'content': 'I'm having trouble solving the limits of functions in calculus. Could you explain how to work with them? Limits of functions are denoted by lim x→a f(x) or lim x→a [f(x)]. It is read as "the limit as x approaches a of f
# of x".'
#             },
#             {
#                 'role': 'assistant',
#                 'content': 'Limits are indeed a fundamental concept in calculus, and understanding them can be a bit tricky at first, but don't worry, I'm here to help! The notation lim x→a f(x) indeed means "the limit as x approaches a of f of
# x". What it's asking us to do is find the'
#             }
#         ]
#     },
#     {
#         'conversation': [
#             {'role': 'system', 'content': "You're an expert florist AI assistant that helps user to erradicate pests in their crops."},
#             {
#                 'role': 'user',
#                 'content': "As a flower shop owner, I'm noticing some unusual worm-like creatures causing damage to my roses and other flowers. Can you help me identify what the problem is? Based on your expertise as a florist AI assistant, I think it
# might be pests or diseases, but I'm not sure which."
#             },
#             {
#                 'role': 'assistant',
#                 'content': "I'd be delighted to help you investigate the issue! Since you've noticed worm-like creatures damaging your roses and other flowers, I'll take a closer look at the possibilities. Here are a few potential culprits: 1.
# **Aphids**: These small, soft-bodied insects can secrete a sticky substance called"
#             }
#         ]
#     }
# ]
源代码位于 src/distilabel/steps/tasks/magpie/base.py
class Magpie(Task, MagpieBase):
    """Generates conversations using an instruct fine-tuned LLM.

    Magpie is a neat method that allows generating user instructions with no seed data
    or specific system prompt thanks to the autoregressive capabilities of the instruct
    fine-tuned LLMs. As they were fine-tuned using a chat template composed by a user message
    and a desired assistant output, the instruct fine-tuned LLM learns that after the pre-query
    or pre-instruct tokens comes an instruction. If these pre-query tokens are sent to the
    LLM without any user message, then the LLM will continue generating tokens as if it was
    the user. This trick allows "extracting" instructions from the instruct fine-tuned LLM.
    After this instruct is generated, it can be sent again to the LLM to generate this time
    an assistant response. This process can be repeated N times allowing to build a multi-turn
    conversation. This method was described in the paper 'Magpie: Alignment Data Synthesis from
    Scratch by Prompting Aligned LLMs with Nothing'.

    Attributes:
        n_turns: the number of turns that the generated conversation will have.
            Defaults to `1`.
        end_with_user: whether the conversation should end with a user message.
            Defaults to `False`.
        include_system_prompt: whether to include the system prompt used in the generated
            conversation. Defaults to `False`.
        only_instruction: whether to generate only the instruction. If this argument is
            `True`, then `n_turns` will be ignored. Defaults to `False`.
        system_prompt: an optional system prompt, or a list of system prompts from which
            a random one will be chosen, or a dictionary of system prompts from which a
            random one will be choosen, or a dictionary of system prompts with their probability
            of being chosen. The random system prompt will be chosen per input/output batch.
            This system prompt can be used to guide the generation of the instruct LLM and
            steer it to generate instructions of a certain topic. Defaults to `None`.

    Runtime parameters:
        - `n_turns`: the number of turns that the generated conversation will have. Defaults
            to `1`.
        - `end_with_user`: whether the conversation should end with a user message.
            Defaults to `False`.
        - `include_system_prompt`: whether to include the system prompt used in the generated
            conversation. Defaults to `False`.
        - `only_instruction`: whether to generate only the instruction. If this argument is
            `True`, then `n_turns` will be ignored. Defaults to `False`.
        - `system_prompt`: an optional system prompt or list of system prompts that can
            be used to steer the LLM to generate content of certain topic, guide the style,
            etc. If it's a list of system prompts, then a random system prompt will be chosen
            per input/output batch. If the provided inputs contains a `system_prompt` column,
            then this runtime parameter will be ignored and the one from the column will
            be used. Defaults to `None`.
        - `system_prompt`: an optional system prompt, or a list of system prompts from which
            a random one will be chosen, or a dictionary of system prompts from which a
            random one will be choosen, or a dictionary of system prompts with their probability
            of being chosen. The random system prompt will be chosen per input/output batch.
            This system prompt can be used to guide the generation of the instruct LLM and
            steer it to generate instructions of a certain topic.

    Input columns:
        - system_prompt (`str`, optional): an optional system prompt that can be provided
            to guide the generation of the instruct LLM and steer it to generate instructions
            of certain topic.

    Output columns:
        - conversation (`ChatType`): the generated conversation which is a list of chat
            items with a role and a message. Only if `only_instruction=False`.
        - instruction (`str`): the generated instructions if `only_instruction=True` or `n_turns==1`.
        - response (`str`): the generated response if `n_turns==1`.
        - system_prompt_key (`str`, optional): the key of the system prompt used to generate
            the conversation or instruction. Only if `system_prompt` is a dictionary.
        - model_name (`str`): The model name used to generate the `conversation` or `instruction`.

    Categories:
        - text-generation
        - instruction

    References:
        - [Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing](https://arxiv.org/abs/2406.08464)

    Examples:
        Generating instructions with Llama 3 8B Instruct and TransformersLLM:

        ```python
        from distilabel.models import TransformersLLM
        from distilabel.steps.tasks import Magpie

        magpie = Magpie(
            llm=TransformersLLM(
                model="meta-llama/Meta-Llama-3-8B-Instruct",
                magpie_pre_query_template="llama3",
                generation_kwargs={
                    "temperature": 1.0,
                    "max_new_tokens": 64,
                },
                device="mps",
            ),
            only_instruction=True,
        )

        magpie.load()

        result = next(
            magpie.process(
                inputs=[
                    {
                        "system_prompt": "You're a math expert AI assistant that helps students of secondary school to solve calculus problems."
                    },
                    {
                        "system_prompt": "You're an expert florist AI assistant that helps user to erradicate pests in their crops."
                    },
                ]
            )
        )
        # [
        #     {'instruction': "That's me! I'd love some help with solving calculus problems! What kind of calculation are you most effective at? Linear Algebra, derivatives, integrals, optimization?"},
        #     {'instruction': 'I was wondering if there are certain flowers and plants that can be used for pest control?'}
        # ]
        ```

        Generating conversations with Llama 3 8B Instruct and TransformersLLM:

        ```python
        from distilabel.models import TransformersLLM
        from distilabel.steps.tasks import Magpie

        magpie = Magpie(
            llm=TransformersLLM(
                model="meta-llama/Meta-Llama-3-8B-Instruct",
                magpie_pre_query_template="llama3",
                generation_kwargs={
                    "temperature": 1.0,
                    "max_new_tokens": 256,
                },
                device="mps",
            ),
            n_turns=2,
        )

        magpie.load()

        result = next(
            magpie.process(
                inputs=[
                    {
                        "system_prompt": "You're a math expert AI assistant that helps students of secondary school to solve calculus problems."
                    },
                    {
                        "system_prompt": "You're an expert florist AI assistant that helps user to erradicate pests in their crops."
                    },
                ]
            )
        )
        # [
        #     {
        #         'conversation': [
        #             {'role': 'system', 'content': "You're a math expert AI assistant that helps students of secondary school to solve calculus problems."},
        #             {
        #                 'role': 'user',
        #                 'content': 'I\'m having trouble solving the limits of functions in calculus. Could you explain how to work with them? Limits of functions are denoted by lim x→a f(x) or lim x→a [f(x)]. It is read as "the limit as x approaches a of f
        # of x".'
        #             },
        #             {
        #                 'role': 'assistant',
        #                 'content': 'Limits are indeed a fundamental concept in calculus, and understanding them can be a bit tricky at first, but don\'t worry, I\'m here to help! The notation lim x→a f(x) indeed means "the limit as x approaches a of f of
        # x". What it\'s asking us to do is find the'
        #             }
        #         ]
        #     },
        #     {
        #         'conversation': [
        #             {'role': 'system', 'content': "You're an expert florist AI assistant that helps user to erradicate pests in their crops."},
        #             {
        #                 'role': 'user',
        #                 'content': "As a flower shop owner, I'm noticing some unusual worm-like creatures causing damage to my roses and other flowers. Can you help me identify what the problem is? Based on your expertise as a florist AI assistant, I think it
        # might be pests or diseases, but I'm not sure which."
        #             },
        #             {
        #                 'role': 'assistant',
        #                 'content': "I'd be delighted to help you investigate the issue! Since you've noticed worm-like creatures damaging your roses and other flowers, I'll take a closer look at the possibilities. Here are a few potential culprits: 1.
        # **Aphids**: These small, soft-bodied insects can secrete a sticky substance called"
        #             }
        #         ]
        #     }
        # ]
        ```
    """

    def model_post_init(self, __context: Any) -> None:
        """Checks that the provided `LLM` uses the `MagpieChatTemplateMixin`."""
        super().model_post_init(__context)

        if not isinstance(self.llm, MagpieChatTemplateMixin):
            raise DistilabelUserError(
                f"`Magpie` task can only be used with an `LLM` that uses the `MagpieChatTemplateMixin`."
                f"`{self.llm.__class__.__name__}` doesn't use the aforementioned mixin.",
                page="components-gallery/tasks/magpie/",
            )

        self.llm.use_magpie_template = True

    @property
    def inputs(self) -> "StepColumns":
        return {"system_prompt": False}

    def format_input(self, input: Dict[str, Any]) -> "ChatType":
        """Does nothing."""
        return []

    @property
    def outputs(self) -> "StepColumns":
        """Either a multi-turn conversation or the instruction generated."""
        outputs = []

        if self.only_instruction:
            outputs.append("instruction")
        elif self.n_turns == 1:
            outputs.extend(["instruction", "response"])
        else:
            outputs.append("conversation")

        if isinstance(self.system_prompt, dict):
            outputs.append("system_prompt_key")

        outputs.append("model_name")

        return outputs

    def format_output(
        self,
        output: Union[str, None],
        input: Union[Dict[str, Any], None] = None,
    ) -> Dict[str, Any]:
        """Does nothing."""
        return {}

    def process(self, inputs: StepInput) -> "StepOutput":
        """Generate a list of instructions or conversations of the specified number of turns.

        Args:
            inputs: a list of dictionaries that can contain a `system_prompt` key.

        Yields:
            The list of generated conversations.
        """
        yield self._generate_with_pre_query_template(inputs)
outputs property

生成的多轮对话或指令。

model_post_init(__context)

检查提供的 LLM 是否使用了 MagpieChatTemplateMixin

源代码位于 src/distilabel/steps/tasks/magpie/base.py
def model_post_init(self, __context: Any) -> None:
    """Checks that the provided `LLM` uses the `MagpieChatTemplateMixin`."""
    super().model_post_init(__context)

    if not isinstance(self.llm, MagpieChatTemplateMixin):
        raise DistilabelUserError(
            f"`Magpie` task can only be used with an `LLM` that uses the `MagpieChatTemplateMixin`."
            f"`{self.llm.__class__.__name__}` doesn't use the aforementioned mixin.",
            page="components-gallery/tasks/magpie/",
        )

    self.llm.use_magpie_template = True
format_input(input)

不做任何操作。

源代码位于 src/distilabel/steps/tasks/magpie/base.py
def format_input(self, input: Dict[str, Any]) -> "ChatType":
    """Does nothing."""
    return []
format_output(output, input=None)

不做任何操作。

源代码位于 src/distilabel/steps/tasks/magpie/base.py
def format_output(
    self,
    output: Union[str, None],
    input: Union[Dict[str, Any], None] = None,
) -> Dict[str, Any]:
    """Does nothing."""
    return {}
process(inputs)

生成指定轮数的指令或对话列表。

参数

名称 类型 描述 默认值
inputs StepInput

可以包含 system_prompt 键的字典列表。

必需

产生

类型 描述
StepOutput

生成的对话列表。

源代码位于 src/distilabel/steps/tasks/magpie/base.py
def process(self, inputs: StepInput) -> "StepOutput":
    """Generate a list of instructions or conversations of the specified number of turns.

    Args:
        inputs: a list of dictionaries that can contain a `system_prompt` key.

    Yields:
        The list of generated conversations.
    """
    yield self._generate_with_pre_query_template(inputs)

MagpieGenerator

基类: GeneratorTask, MagpieBase

生成器任务,使用 Magpie 生成指令或对话。

Magpie 是一种简洁的方法,它允许在没有种子数据或特定系统提示的情况下生成用户指令,这归功于指令微调的 LLM 的自回归能力。由于它们是使用由用户消息和期望的助手输出组成的聊天模板进行微调的,因此指令微调的 LLM 了解到在预查询或预指令令牌之后会跟着一个指令。如果将这些预查询令牌发送到 LLM 而不发送任何用户消息,则 LLM 将继续生成令牌,就像它是用户一样。这个技巧允许从指令微调的 LLM 中“提取”指令。在生成此指令后,可以再次将其发送到 LLM 以生成助手响应。这个过程可以重复 N 次,从而构建多轮对话。这种方法在论文“Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing”中进行了描述。

属性

名称 类型 描述
n_turns RuntimeParameter[PositiveInt]

生成的对话将具有的回合数。默认为 1

end_with_user RuntimeParameter[bool]

对话是否应以用户消息结束。默认为 False

include_system_prompt RuntimeParameter[bool]

是否包含在生成的对话中使用的系统提示。默认为 False

only_instruction RuntimeParameter[bool]

是否仅生成指令。如果此参数为 True,则将忽略 n_turns。默认为 False

system_prompt Optional[RuntimeParameter[Union[List[str], Dict[str, str], Dict[str, Tuple[str, float]], str]]]

可选的系统提示,或系统提示列表(将从中随机选择一个),或系统提示字典(将从中随机选择一个),或系统提示字典及其被选择的概率。随机系统提示将按每个输入/输出批次选择。此系统提示可用于指导指令 LLM 的生成,并引导其生成特定主题的指令。默认为 None

num_rows RuntimeParameter[int]

要生成的行数。

运行时参数
  • n_turns: 生成的对话将具有的回合数。默认为 1
  • end_with_user: 对话是否应以用户消息结束。默认为 False
  • include_system_prompt: 是否包含在生成的对话中使用的系统提示。默认为 False
  • only_instruction: 是否仅生成指令。如果此参数为 True,则将忽略 n_turns。默认为 False
  • system_prompt: 可选的系统提示,或系统提示列表(将从中随机选择一个),或系统提示字典(将从中随机选择一个),或系统提示字典及其被选择的概率。随机系统提示将按每个输入/输出批次选择。此系统提示可用于指导指令 LLM 的生成,并引导其生成特定主题的指令。
  • num_rows: 要生成的行数。
输出列
  • conversation (ChatType): 生成的对话,这是一个聊天项列表,包含角色和消息。
  • instruction (str): 如果 only_instruction=True,则为生成的指令。
  • response (str): 生成的回复,如果 n_turns==1
  • system_prompt_key (str, optional): 用于生成对话或指令的系统提示的键。仅当 system_prompt 是字典时。
  • model_name (str): 用于生成 conversationinstruction 的模型名称。
类别
  • 文本生成
  • instruction
  • generator
参考

示例

使用 Llama 3 8B Instruct 和 TransformersLLM 生成指令

from distilabel.models import TransformersLLM
from distilabel.steps.tasks import MagpieGenerator

generator = MagpieGenerator(
    llm=TransformersLLM(
        model="meta-llama/Meta-Llama-3-8B-Instruct",
        magpie_pre_query_template="llama3",
        generation_kwargs={
            "temperature": 1.0,
            "max_new_tokens": 256,
        },
        device="mps",
    ),
    only_instruction=True,
    num_rows=5,
)

generator.load()

result = next(generator.process())
# (
#       [
#           {"instruction": "I've just bought a new phone and I're excited to start using it."},
#           {"instruction": "What are the most common types of companies that use digital signage?"}
#       ],
#       True
# )

使用 Llama 3 8B Instruct 和 TransformersLLM 生成对话

from distilabel.models import TransformersLLM
from distilabel.steps.tasks import MagpieGenerator

generator = MagpieGenerator(
    llm=TransformersLLM(
        model="meta-llama/Meta-Llama-3-8B-Instruct",
        magpie_pre_query_template="llama3",
        generation_kwargs={
            "temperature": 1.0,
            "max_new_tokens": 64,
        },
        device="mps",
    ),
    n_turns=3,
    num_rows=5,
)

generator.load()

result = next(generator.process())
# (
#     [
#         {
#             'conversation': [
#                 {
#                     'role': 'system',
#                     'content': 'You are a helpful Al assistant. The user will engage in a multi−round conversation with you,asking initial questions and following up with additional related questions. Your goal is to provide thorough, relevant and
# insightful responses to help the user with their queries.'
#                 },
#                 {'role': 'user', 'content': "I'm considering starting a social media campaign for my small business and I're not sure where to start. Can you help?"},
#                 {
#                     'role': 'assistant',
#                     'content': "Exciting endeavor! Creating a social media campaign can be a great way to increase brand awareness, drive website traffic, and ultimately boost sales. I'd be happy to guide you through the process. To get started,
# let's break down the basics. First, we need to identify your goals and target audience. What do"
#                 },
#                 {
#                     'role': 'user',
#                     'content': "Before I start a social media campaign, what kind of costs ammol should I expect to pay? There are several factors that contribute to the total cost of running a social media campaign. Let me outline some of the main
# expenses you might encounter: 1. Time: As the business owner, you'll likely spend time creating"
#                 },
#                 {
#                     'role': 'assistant',
#                     'content': 'Time is indeed one of the biggest investments when it comes to running a social media campaign! Besides time, you may also incur costs associated with: 2. Content creation: You might need to hire freelancers or
# agencies to create high-quality content (images, videos, captions) for your social media platforms. 3. Advertising'
#                 }
#             ]
#         },
#         {
#             'conversation': [
#                 {
#                     'role': 'system',
#                     'content': 'You are a helpful Al assistant. The user will engage in a multi−round conversation with you,asking initial questions and following up with additional related questions. Your goal is to provide thorough, relevant and
# insightful responses to help the user with their queries.'
#                 },
#                 {'role': 'user', 'content': "I am thinking of buying a new laptop or computer. What are some important factors I should consider when making your decision? I'll make sure to let you know if any other favorites or needs come up!"},
#                 {
#                     'role': 'assistant',
#                     'content': 'Exciting times ahead! When considering a new laptop or computer, there are several key factors to think about to ensure you find the right one for your needs. Here are some crucial ones to get you started: 1.
# **Purpose**: How will you use your laptop or computer? For work, gaming, video editing,'
#                 },
#                 {
#                     'role': 'user',
#                     'content': 'Let me stop you there. Let's explore this "purpose" factor that you mentioned earlier. Can you elaborate more on what type of devices would be suitable for different purposes? For example, if I're primarily using my
# laptop for general usage like browsing, email, and word processing, would a budget-friendly laptop be sufficient'
#                 },
#                 {
#                     'role': 'assistant',
#                     'content': "Understanding your purpose can greatly impact the type of device you'll need. **General Usage (Browsing, Email, Word Processing)**: For casual users who mainly use their laptop for daily tasks, a budget-friendly
# option can be sufficient. Look for laptops with: * Intel Core i3 or i5 processor* "
#                 }
#             ]
#         }
#     ],
#     True
# )

使用带有概率的系统提示生成

from distilabel.models import InferenceEndpointsLLM
from distilabel.steps.tasks import MagpieGenerator

magpie = MagpieGenerator(
    llm=InferenceEndpointsLLM(
        model_id="meta-llama/Meta-Llama-3-8B-Instruct",
        tokenizer_id="meta-llama/Meta-Llama-3-8B-Instruct",
        magpie_pre_query_template="llama3",
        generation_kwargs={
            "temperature": 0.8,
            "max_new_tokens": 256,
        },
    ),
    n_turns=2,
    system_prompt={
        "math": ("You're an expert AI assistant.", 0.8),
        "writing": ("You're an expert writing assistant.", 0.2),
    },
)

magpie.load()

result = next(magpie.process())
引用
@misc{xu2024magpiealignmentdatasynthesis,
    title={Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing},
    author={Zhangchen Xu and Fengqing Jiang and Luyao Niu and Yuntian Deng and Radha Poovendran and Yejin Choi and Bill Yuchen Lin},
    year={2024},
    eprint={2406.08464},
    archivePrefix={arXiv},
    primaryClass={cs.CL},
    url={https://arxiv.org/abs/2406.08464},
}
源代码位于 src/distilabel/steps/tasks/magpie/generator.py
class MagpieGenerator(GeneratorTask, MagpieBase):
    """Generator task the generates instructions or conversations using Magpie.

    Magpie is a neat method that allows generating user instructions with no seed data
    or specific system prompt thanks to the autoregressive capabilities of the instruct
    fine-tuned LLMs. As they were fine-tuned using a chat template composed by a user message
    and a desired assistant output, the instruct fine-tuned LLM learns that after the pre-query
    or pre-instruct tokens comes an instruction. If these pre-query tokens are sent to the
    LLM without any user message, then the LLM will continue generating tokens as it was
    the user. This trick allows "extracting" instructions from the instruct fine-tuned LLM.
    After this instruct is generated, it can be sent again to the LLM to generate this time
    an assistant response. This process can be repeated N times allowing to build a multi-turn
    conversation. This method was described in the paper 'Magpie: Alignment Data Synthesis from
    Scratch by Prompting Aligned LLMs with Nothing'.

    Attributes:
        n_turns: the number of turns that the generated conversation will have.
            Defaults to `1`.
        end_with_user: whether the conversation should end with a user message.
            Defaults to `False`.
        include_system_prompt: whether to include the system prompt used in the generated
            conversation. Defaults to `False`.
        only_instruction: whether to generate only the instruction. If this argument is
            `True`, then `n_turns` will be ignored. Defaults to `False`.
        system_prompt: an optional system prompt, or a list of system prompts from which
            a random one will be chosen, or a dictionary of system prompts from which a
            random one will be choosen, or a dictionary of system prompts with their probability
            of being chosen. The random system prompt will be chosen per input/output batch.
            This system prompt can be used to guide the generation of the instruct LLM and
            steer it to generate instructions of a certain topic. Defaults to `None`.
        num_rows: the number of rows to be generated.

    Runtime parameters:
        - `n_turns`: the number of turns that the generated conversation will have. Defaults
            to `1`.
        - `end_with_user`: whether the conversation should end with a user message.
            Defaults to `False`.
        - `include_system_prompt`: whether to include the system prompt used in the generated
            conversation. Defaults to `False`.
        - `only_instruction`: whether to generate only the instruction. If this argument is
            `True`, then `n_turns` will be ignored. Defaults to `False`.
        - `system_prompt`: an optional system prompt, or a list of system prompts from which
            a random one will be chosen, or a dictionary of system prompts from which a
            random one will be choosen, or a dictionary of system prompts with their probability
            of being chosen. The random system prompt will be chosen per input/output batch.
            This system prompt can be used to guide the generation of the instruct LLM and
            steer it to generate instructions of a certain topic.
        - `num_rows`: the number of rows to be generated.

    Output columns:
        - conversation (`ChatType`): the generated conversation which is a list of chat
            items with a role and a message.
        - instruction (`str`): the generated instructions if `only_instruction=True`.
        - response (`str`): the generated response if `n_turns==1`.
        - system_prompt_key (`str`, optional): the key of the system prompt used to generate
            the conversation or instruction. Only if `system_prompt` is a dictionary.
        - model_name (`str`): The model name used to generate the `conversation` or `instruction`.

    Categories:
        - text-generation
        - instruction
        - generator

    References:
        - [Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing](https://arxiv.org/abs/2406.08464)

    Examples:
        Generating instructions with Llama 3 8B Instruct and TransformersLLM:

        ```python
        from distilabel.models import TransformersLLM
        from distilabel.steps.tasks import MagpieGenerator

        generator = MagpieGenerator(
            llm=TransformersLLM(
                model="meta-llama/Meta-Llama-3-8B-Instruct",
                magpie_pre_query_template="llama3",
                generation_kwargs={
                    "temperature": 1.0,
                    "max_new_tokens": 256,
                },
                device="mps",
            ),
            only_instruction=True,
            num_rows=5,
        )

        generator.load()

        result = next(generator.process())
        # (
        #       [
        #           {"instruction": "I've just bought a new phone and I're excited to start using it."},
        #           {"instruction": "What are the most common types of companies that use digital signage?"}
        #       ],
        #       True
        # )
        ```

        Generating a conversation with Llama 3 8B Instruct and TransformersLLM:

        ```python
        from distilabel.models import TransformersLLM
        from distilabel.steps.tasks import MagpieGenerator

        generator = MagpieGenerator(
            llm=TransformersLLM(
                model="meta-llama/Meta-Llama-3-8B-Instruct",
                magpie_pre_query_template="llama3",
                generation_kwargs={
                    "temperature": 1.0,
                    "max_new_tokens": 64,
                },
                device="mps",
            ),
            n_turns=3,
            num_rows=5,
        )

        generator.load()

        result = next(generator.process())
        # (
        #     [
        #         {
        #             'conversation': [
        #                 {
        #                     'role': 'system',
        #                     'content': 'You are a helpful Al assistant. The user will engage in a multi−round conversation with you,asking initial questions and following up with additional related questions. Your goal is to provide thorough, relevant and
        # insightful responses to help the user with their queries.'
        #                 },
        #                 {'role': 'user', 'content': "I'm considering starting a social media campaign for my small business and I're not sure where to start. Can you help?"},
        #                 {
        #                     'role': 'assistant',
        #                     'content': "Exciting endeavor! Creating a social media campaign can be a great way to increase brand awareness, drive website traffic, and ultimately boost sales. I'd be happy to guide you through the process. To get started,
        # let's break down the basics. First, we need to identify your goals and target audience. What do"
        #                 },
        #                 {
        #                     'role': 'user',
        #                     'content': "Before I start a social media campaign, what kind of costs ammol should I expect to pay? There are several factors that contribute to the total cost of running a social media campaign. Let me outline some of the main
        # expenses you might encounter: 1. Time: As the business owner, you'll likely spend time creating"
        #                 },
        #                 {
        #                     'role': 'assistant',
        #                     'content': 'Time is indeed one of the biggest investments when it comes to running a social media campaign! Besides time, you may also incur costs associated with: 2. Content creation: You might need to hire freelancers or
        # agencies to create high-quality content (images, videos, captions) for your social media platforms. 3. Advertising'
        #                 }
        #             ]
        #         },
        #         {
        #             'conversation': [
        #                 {
        #                     'role': 'system',
        #                     'content': 'You are a helpful Al assistant. The user will engage in a multi−round conversation with you,asking initial questions and following up with additional related questions. Your goal is to provide thorough, relevant and
        # insightful responses to help the user with their queries.'
        #                 },
        #                 {'role': 'user', 'content': "I am thinking of buying a new laptop or computer. What are some important factors I should consider when making your decision? I'll make sure to let you know if any other favorites or needs come up!"},
        #                 {
        #                     'role': 'assistant',
        #                     'content': 'Exciting times ahead! When considering a new laptop or computer, there are several key factors to think about to ensure you find the right one for your needs. Here are some crucial ones to get you started: 1.
        # **Purpose**: How will you use your laptop or computer? For work, gaming, video editing,'
        #                 },
        #                 {
        #                     'role': 'user',
        #                     'content': 'Let me stop you there. Let\'s explore this "purpose" factor that you mentioned earlier. Can you elaborate more on what type of devices would be suitable for different purposes? For example, if I\'re primarily using my
        # laptop for general usage like browsing, email, and word processing, would a budget-friendly laptop be sufficient'
        #                 },
        #                 {
        #                     'role': 'assistant',
        #                     'content': "Understanding your purpose can greatly impact the type of device you'll need. **General Usage (Browsing, Email, Word Processing)**: For casual users who mainly use their laptop for daily tasks, a budget-friendly
        # option can be sufficient. Look for laptops with: * Intel Core i3 or i5 processor* "
        #                 }
        #             ]
        #         }
        #     ],
        #     True
        # )
        ```

        Generating with system prompts with probabilities:

        ```python
        from distilabel.models import InferenceEndpointsLLM
        from distilabel.steps.tasks import MagpieGenerator

        magpie = MagpieGenerator(
            llm=InferenceEndpointsLLM(
                model_id="meta-llama/Meta-Llama-3-8B-Instruct",
                tokenizer_id="meta-llama/Meta-Llama-3-8B-Instruct",
                magpie_pre_query_template="llama3",
                generation_kwargs={
                    "temperature": 0.8,
                    "max_new_tokens": 256,
                },
            ),
            n_turns=2,
            system_prompt={
                "math": ("You're an expert AI assistant.", 0.8),
                "writing": ("You're an expert writing assistant.", 0.2),
            },
        )

        magpie.load()

        result = next(magpie.process())
        ```

    Citations:
        ```
        @misc{xu2024magpiealignmentdatasynthesis,
            title={Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing},
            author={Zhangchen Xu and Fengqing Jiang and Luyao Niu and Yuntian Deng and Radha Poovendran and Yejin Choi and Bill Yuchen Lin},
            year={2024},
            eprint={2406.08464},
            archivePrefix={arXiv},
            primaryClass={cs.CL},
            url={https://arxiv.org/abs/2406.08464},
        }
        ```
    """

    # TODO: move this to `GeneratorTask`
    num_rows: RuntimeParameter[int] = Field(
        default=None, description="The number of rows to generate."
    )

    def model_post_init(self, __context: Any) -> None:
        """Checks that the provided `LLM` uses the `MagpieChatTemplateMixin`."""
        super().model_post_init(__context)

        if not isinstance(self.llm, MagpieChatTemplateMixin):
            raise DistilabelUserError(
                f"`Magpie` task can only be used with an `LLM` that uses the `MagpieChatTemplateMixin`."
                f"`{self.llm.__class__.__name__}` doesn't use the aforementioned mixin.",
                page="components-gallery/tasks/magpiegenerator/",
            )

        self.llm.use_magpie_template = True

    @property
    def outputs(self) -> "StepColumns":
        """Either a multi-turn conversation or the instruction generated."""
        outputs = []

        if self.only_instruction:
            outputs.append("instruction")
        elif self.n_turns == 1:
            outputs.extend(["instruction", "response"])
        else:
            outputs.append("conversation")

        if isinstance(self.system_prompt, dict):
            outputs.append("system_prompt_key")

        outputs.append("model_name")

        return outputs

    def format_output(
        self,
        output: Union[str, None],
        input: Union[Dict[str, Any], None] = None,
    ) -> Dict[str, Any]:
        """Does nothing."""
        return {}

    def process(self, offset: int = 0) -> "GeneratorStepOutput":
        """Generates the desired number of instructions or conversations using Magpie.

        Args:
            offset: The offset to start the generation from. Defaults to `0`.

        Yields:
            The generated instructions or conversations.
        """
        generated = offset

        while generated <= self.num_rows:  # type: ignore
            rows_to_generate = (
                self.num_rows if self.num_rows < self.batch_size else self.batch_size  # type: ignore
            )
            conversations = self._generate_with_pre_query_template(
                inputs=[{} for _ in range(rows_to_generate)]  # type: ignore
            )
            generated += rows_to_generate  # type: ignore
            yield (conversations, generated == self.num_rows)

    @override
    def _sample_input(self) -> "ChatType":
        return self._generate_with_pre_query_template(inputs=[{}])
outputs property

生成的多轮对话或指令。

model_post_init(__context)

检查提供的 LLM 是否使用了 MagpieChatTemplateMixin

源代码位于 src/distilabel/steps/tasks/magpie/generator.py
def model_post_init(self, __context: Any) -> None:
    """Checks that the provided `LLM` uses the `MagpieChatTemplateMixin`."""
    super().model_post_init(__context)

    if not isinstance(self.llm, MagpieChatTemplateMixin):
        raise DistilabelUserError(
            f"`Magpie` task can only be used with an `LLM` that uses the `MagpieChatTemplateMixin`."
            f"`{self.llm.__class__.__name__}` doesn't use the aforementioned mixin.",
            page="components-gallery/tasks/magpiegenerator/",
        )

    self.llm.use_magpie_template = True
format_output(output, input=None)

不做任何操作。

源代码位于 src/distilabel/steps/tasks/magpie/generator.py
def format_output(
    self,
    output: Union[str, None],
    input: Union[Dict[str, Any], None] = None,
) -> Dict[str, Any]:
    """Does nothing."""
    return {}
process(offset=0)

使用 Magpie 生成所需数量的指令或对话。

参数

名称 类型 描述 默认值
offset int

开始生成的偏移量。默认为 0

0

产生

类型 描述
GeneratorStepOutput

生成的指令或对话。

源代码位于 src/distilabel/steps/tasks/magpie/generator.py
def process(self, offset: int = 0) -> "GeneratorStepOutput":
    """Generates the desired number of instructions or conversations using Magpie.

    Args:
        offset: The offset to start the generation from. Defaults to `0`.

    Yields:
        The generated instructions or conversations.
    """
    generated = offset

    while generated <= self.num_rows:  # type: ignore
        rows_to_generate = (
            self.num_rows if self.num_rows < self.batch_size else self.batch_size  # type: ignore
        )
        conversations = self._generate_with_pre_query_template(
            inputs=[{} for _ in range(rows_to_generate)]  # type: ignore
        )
        generated += rows_to_generate  # type: ignore
        yield (conversations, generated == self.num_rows)

MathShepherdCompleter

基类:Task

Math Shepherd 完成器和自动标注器任务。

此任务负责:给定指令的一系列解决方案和一个黄金解决方案作为参考,为这些解决方案生成补全,并使用参考论文图 2 中的硬估计方法(公式 3)根据黄金解决方案对它们进行标注。这些属性使该任务能够灵活地用于不同类型的数据集和 LLM,并允许利用不同的字段来修改其系统提示和用户提示。在修改它们之前,请查看当前的默认值,以确保补全生成正确。

属性

名称 类型 描述
system_prompt Optional[str]

要在补全中使用的系统提示。默认的系统提示已经过检查,在使用 Llama 3.1 的 8B 和 70B 模型时生成了良好的补全,但可以对其进行修改以使其适应所选的模型和数据集。

extra_rules Optional[str]

此字段可用于插入与数据集类型相关的额外规则。例如,在原始论文中,他们使用了 GSM8K 和 MATH 数据集,此字段可用于插入 GSM8K 数据集的规则。

few_shots Optional[str]

少量示例,用于帮助模型生成补全,以您数据集所需的解决方案类型格式编写它们。

N PositiveInt

为每个步骤生成的补全数量,对应于论文中的 N。他们在论文中使用了 8,但可以进行调整。

tags list[str]

要在补全中使用的标签列表,默认标签为 ["+", "-"],与论文中相同,其中第一个用作正标签,第二个用作负标签。可以更新此列表,但它必须是一个包含 2 个元素的列表,其中第一个是正标签,第二个是负标签。

输入列
  • instruction (str): 任务或指令。
  • solutions (List[str]): 任务的解决方案列表。
  • golden_solution (str): 任务的参考解决方案,将用于标注候选解决方案。
输出列
  • solutions (List[str]): 与用作输入的列相同的列,“solutions”列已修改。
  • model_name (str): 用于生成修订的模型名称。
类别
  • 文本生成
  • labelling
参考

示例

使用结构化输出(首选方式)通过 Math Shepherd 完成器标注您的步骤

from distilabel.steps.tasks import MathShepherdCompleter
from distilabel.models import InferenceEndpointsLLM

llm=InferenceEndpointsLLM(
    model_id="meta-llama/Meta-Llama-3.1-8B-Instruct",
    tokenizer_id="meta-llama/Meta-Llama-3.1-8B-Instruct",
    generation_kwargs={
        "temperature": 0.6,
        "max_new_tokens": 1024,
    },
)
task = MathShepherdCompleter(
    llm=llm,
    N=3,
    use_default_structured_output=True
)

task.load()

result = next(
    task.process(
        [
            {
                "instruction": "Janet’s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market?",
                "golden_solution": ["Step 1: Janet sells 16 - 3 - 4 = <<16-3-4=9>>9 duck eggs a day.", "Step 2: She makes 9 * 2 = $<<9*2=18>>18 every day at the farmer’s market.", "The answer is: 18"],
                "solutions": [
                    ["Step 1: Janet sells 16 - 3 - 4 = <<16-3-4=9>>9 duck eggs a day.", "Step 2: She makes 9 * 2 = $<<9*2=18>>18 every day at the farmer’s market.", "The answer is: 18"],
                    ['Step 1: Janets ducks lay 16 eggs per day, and she uses 3 + 4 = <<3+4=7>>7 for eating and baking.', 'Step 2: So she sells 16 - 7 = <<16-7=9>>9 duck eggs every day.', 'Step 3: Those 9 eggs are worth 9 * $2 = $<<9*2=18>>18.', 'The answer is: 18'],
                ]
            },
        ]
    )
)
# [[{'instruction': "Janet’s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market?",
# 'golden_solution': ["Step 1: Janet sells 16 - 3 - 4 = <<16-3-4=9>>9 duck eggs a day.", "Step 2: She makes 9 * 2 = $<<9*2=18>>18 every day at the farmer\u2019s market.", "The answer is: 18"],
# 'solutions': [["Step 1: Janet sells 16 - 3 - 4 = <<16-3-4=9>>9 duck eggs a day. -", "Step 2: She makes 9 * 2 = $<<9*2=18>>18 every day at the farmer\u2019s market.", "The answer is: 18"], ["Step 1: Janets ducks lay 16 eggs per day, and she uses 3 + 4 = <<3+4=7>>7 for eating and baking. +", "Step 2: So she sells 16 - 7 = <<16-7=9>>9 duck eggs every day. +", "Step 3: Those 9 eggs are worth 9 * $2 = $<<9*2=18>>18.", "The answer is: 18"]]}]]

使用 Math Shepherd 完成器标注您的步骤

from distilabel.steps.tasks import MathShepherdCompleter
from distilabel.models import InferenceEndpointsLLM

llm=InferenceEndpointsLLM(
    model_id="meta-llama/Meta-Llama-3.1-8B-Instruct",
    tokenizer_id="meta-llama/Meta-Llama-3.1-8B-Instruct",
    generation_kwargs={
        "temperature": 0.6,
        "max_new_tokens": 1024,
    },
)
task = MathShepherdCompleter(
    llm=llm,
    N=3
)

task.load()

result = next(
    task.process(
        [
            {
                "instruction": "Janet’s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market?",
                "golden_solution": ["Step 1: Janet sells 16 - 3 - 4 = <<16-3-4=9>>9 duck eggs a day.", "Step 2: She makes 9 * 2 = $<<9*2=18>>18 every day at the farmer’s market.", "The answer is: 18"],
                "solutions": [
                    ["Step 1: Janet sells 16 - 3 - 4 = <<16-3-4=9>>9 duck eggs a day.", "Step 2: She makes 9 * 2 = $<<9*2=18>>18 every day at the farmer’s market.", "The answer is: 18"],
                    ['Step 1: Janets ducks lay 16 eggs per day, and she uses 3 + 4 = <<3+4=7>>7 for eating and baking.', 'Step 2: So she sells 16 - 7 = <<16-7=9>>9 duck eggs every day.', 'Step 3: Those 9 eggs are worth 9 * $2 = $<<9*2=18>>18.', 'The answer is: 18'],
                ]
            },
        ]
    )
)
# [[{'instruction': "Janet’s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market?",
# 'golden_solution': ["Step 1: Janet sells 16 - 3 - 4 = <<16-3-4=9>>9 duck eggs a day.", "Step 2: She makes 9 * 2 = $<<9*2=18>>18 every day at the farmer\u2019s market.", "The answer is: 18"],
# 'solutions': [["Step 1: Janet sells 16 - 3 - 4 = <<16-3-4=9>>9 duck eggs a day. -", "Step 2: She makes 9 * 2 = $<<9*2=18>>18 every day at the farmer\u2019s market.", "The answer is: 18"], ["Step 1: Janets ducks lay 16 eggs per day, and she uses 3 + 4 = <<3+4=7>>7 for eating and baking. +", "Step 2: So she sells 16 - 7 = <<16-7=9>>9 duck eggs every day. +", "Step 3: Those 9 eggs are worth 9 * $2 = $<<9*2=18>>18.", "The answer is: 18"]]}]]

引用

```
@misc{wang2024mathshepherdverifyreinforcellms,
    title={Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations},
    author={Peiyi Wang and Lei Li and Zhihong Shao and R. X. Xu and Damai Dai and Yifei Li and Deli Chen and Y. Wu and Zhifang Sui},
    year={2024},
    eprint={2312.08935},
    archivePrefix={arXiv},
    primaryClass={cs.AI},
    url={https://arxiv.org/abs/2312.08935},
}
```
源代码位于 src/distilabel/steps/tasks/math_shepherd/completer.py
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
class MathShepherdCompleter(Task):
    """Math Shepherd Completer and auto-labeller task.

    This task is in charge of, given a list of solutions to an instruction, and a golden solution,
    as reference, generate completions for the solutions, and label them according to the golden
    solution using the hard estimation method from figure 2 in the reference paper, Eq. 3.
    The attributes make the task flexible to be used with different types of dataset and LLMs, and
    allow making use of different fields to modify the system and user prompts for it. Before modifying
    them, review the current defaults to ensure the completions are generated correctly.

    Attributes:
        system_prompt: The system prompt to be used in the completions. The default one has been
            checked and generates good completions using Llama 3.1 with 8B and 70B,
            but it can be modified to adapt it to the model and dataset selected.
        extra_rules: This field can be used to insert extra rules relevant to the type of dataset.
            For example, in the original paper they used GSM8K and MATH datasets, and this field
            can be used to insert the rules for the GSM8K dataset.
        few_shots: Few shots to help the model generating the completions, write them in the
            format of the type of solutions wanted for your dataset.
        N: Number of completions to generate for each step, correspond to N in the paper.
            They used 8 in the paper, but it can be adjusted.
        tags: List of tags to be used in the completions, the default ones are ["+", "-"] as in the
            paper, where the first is used as a positive label, and the second as a negative one.
            This can be updated, but it MUST be a list with 2 elements, where the first is the
            positive one, and the second the negative one.

    Input columns:
        - instruction (`str`): The task or instruction.
        - solutions (`List[str]`): List of solutions to the task.
        - golden_solution (`str`): The reference solution to the task, will be used
            to annotate the candidate solutions.

    Output columns:
        - solutions (`List[str]`): The same columns that were used as input, the "solutions" is modified.
        - model_name (`str`): The name of the model used to generate the revision.

    Categories:
        - text-generation
        - labelling

    References:
        - [`Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations`](https://arxiv.org/abs/2312.08935)

    Examples:
        Annotate your steps with the Math Shepherd Completer using the structured outputs (the preferred way):

        ```python
        from distilabel.steps.tasks import MathShepherdCompleter
        from distilabel.models import InferenceEndpointsLLM

        llm=InferenceEndpointsLLM(
            model_id="meta-llama/Meta-Llama-3.1-8B-Instruct",
            tokenizer_id="meta-llama/Meta-Llama-3.1-8B-Instruct",
            generation_kwargs={
                "temperature": 0.6,
                "max_new_tokens": 1024,
            },
        )
        task = MathShepherdCompleter(
            llm=llm,
            N=3,
            use_default_structured_output=True
        )

        task.load()

        result = next(
            task.process(
                [
                    {
                        "instruction": "Janet’s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market?",
                        "golden_solution": ["Step 1: Janet sells 16 - 3 - 4 = <<16-3-4=9>>9 duck eggs a day.", "Step 2: She makes 9 * 2 = $<<9*2=18>>18 every day at the farmer’s market.", "The answer is: 18"],
                        "solutions": [
                            ["Step 1: Janet sells 16 - 3 - 4 = <<16-3-4=9>>9 duck eggs a day.", "Step 2: She makes 9 * 2 = $<<9*2=18>>18 every day at the farmer’s market.", "The answer is: 18"],
                            ['Step 1: Janets ducks lay 16 eggs per day, and she uses 3 + 4 = <<3+4=7>>7 for eating and baking.', 'Step 2: So she sells 16 - 7 = <<16-7=9>>9 duck eggs every day.', 'Step 3: Those 9 eggs are worth 9 * $2 = $<<9*2=18>>18.', 'The answer is: 18'],
                        ]
                    },
                ]
            )
        )
        # [[{'instruction': "Janet’s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market?",
        # 'golden_solution': ["Step 1: Janet sells 16 - 3 - 4 = <<16-3-4=9>>9 duck eggs a day.", "Step 2: She makes 9 * 2 = $<<9*2=18>>18 every day at the farmer\\u2019s market.", "The answer is: 18"],
        # 'solutions': [["Step 1: Janet sells 16 - 3 - 4 = <<16-3-4=9>>9 duck eggs a day. -", "Step 2: She makes 9 * 2 = $<<9*2=18>>18 every day at the farmer\\u2019s market.", "The answer is: 18"], ["Step 1: Janets ducks lay 16 eggs per day, and she uses 3 + 4 = <<3+4=7>>7 for eating and baking. +", "Step 2: So she sells 16 - 7 = <<16-7=9>>9 duck eggs every day. +", "Step 3: Those 9 eggs are worth 9 * $2 = $<<9*2=18>>18.", "The answer is: 18"]]}]]
        ```

        Annotate your steps with the Math Shepherd Completer:

        ```python
        from distilabel.steps.tasks import MathShepherdCompleter
        from distilabel.models import InferenceEndpointsLLM

        llm=InferenceEndpointsLLM(
            model_id="meta-llama/Meta-Llama-3.1-8B-Instruct",
            tokenizer_id="meta-llama/Meta-Llama-3.1-8B-Instruct",
            generation_kwargs={
                "temperature": 0.6,
                "max_new_tokens": 1024,
            },
        )
        task = MathShepherdCompleter(
            llm=llm,
            N=3
        )

        task.load()

        result = next(
            task.process(
                [
                    {
                        "instruction": "Janet’s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market?",
                        "golden_solution": ["Step 1: Janet sells 16 - 3 - 4 = <<16-3-4=9>>9 duck eggs a day.", "Step 2: She makes 9 * 2 = $<<9*2=18>>18 every day at the farmer’s market.", "The answer is: 18"],
                        "solutions": [
                            ["Step 1: Janet sells 16 - 3 - 4 = <<16-3-4=9>>9 duck eggs a day.", "Step 2: She makes 9 * 2 = $<<9*2=18>>18 every day at the farmer’s market.", "The answer is: 18"],
                            ['Step 1: Janets ducks lay 16 eggs per day, and she uses 3 + 4 = <<3+4=7>>7 for eating and baking.', 'Step 2: So she sells 16 - 7 = <<16-7=9>>9 duck eggs every day.', 'Step 3: Those 9 eggs are worth 9 * $2 = $<<9*2=18>>18.', 'The answer is: 18'],
                        ]
                    },
                ]
            )
        )
        # [[{'instruction': "Janet’s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market?",
        # 'golden_solution': ["Step 1: Janet sells 16 - 3 - 4 = <<16-3-4=9>>9 duck eggs a day.", "Step 2: She makes 9 * 2 = $<<9*2=18>>18 every day at the farmer\\u2019s market.", "The answer is: 18"],
        # 'solutions': [["Step 1: Janet sells 16 - 3 - 4 = <<16-3-4=9>>9 duck eggs a day. -", "Step 2: She makes 9 * 2 = $<<9*2=18>>18 every day at the farmer\\u2019s market.", "The answer is: 18"], ["Step 1: Janets ducks lay 16 eggs per day, and she uses 3 + 4 = <<3+4=7>>7 for eating and baking. +", "Step 2: So she sells 16 - 7 = <<16-7=9>>9 duck eggs every day. +", "Step 3: Those 9 eggs are worth 9 * $2 = $<<9*2=18>>18.", "The answer is: 18"]]}]]
        ```

    Citations:

        ```
        @misc{wang2024mathshepherdverifyreinforcellms,
            title={Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations},
            author={Peiyi Wang and Lei Li and Zhihong Shao and R. X. Xu and Damai Dai and Yifei Li and Deli Chen and Y. Wu and Zhifang Sui},
            year={2024},
            eprint={2312.08935},
            archivePrefix={arXiv},
            primaryClass={cs.AI},
            url={https://arxiv.org/abs/2312.08935},
        }
        ```
    """

    system_prompt: Optional[str] = SYSTEM_PROMPT
    extra_rules: Optional[str] = RULES_GSM8K
    few_shots: Optional[str] = FEW_SHOTS_GSM8K
    N: PositiveInt = 1
    tags: list[str] = ["+", "-"]

    def load(self) -> None:
        super().load()

        if self.system_prompt is not None:
            self.system_prompt = Template(self.system_prompt).render(
                extra_rules=self.extra_rules or "",
                few_shots=self.few_shots or "",
                structured_prompt=SYSTEM_PROMPT_STRUCTURED
                if self.use_default_structured_output
                else "",
            )
        if self.use_default_structured_output:
            self._template = Template(TEMPLATE_STRUCTURED)
        else:
            self._template = Template(TEMPLATE)

    @property
    def inputs(self) -> "StepColumns":
        return ["instruction", "solutions", "golden_solution"]

    @property
    def outputs(self) -> "StepColumns":
        return ["model_name"]

    def format_input(self, input: Dict[str, Any]) -> "ChatType":
        messages = [
            {
                "role": "user",
                "content": self._template.render(
                    instruction=input["instruction"], N=self.N
                ),
            }
        ]
        if self.system_prompt:
            messages.insert(0, {"role": "system", "content": self.system_prompt})
        return messages  # type: ignore

    def _parse_output(self, output: Union[str, None]) -> list[list[str]]:
        if output is None:
            return [[""]] * self.N

        if self.N > 1:
            output_transformed = (  # type: ignore
                self._format_structured_output(output)
                if self.use_default_structured_output
                else output.split("---")
            )
            examples = [split_solution_steps(o) for o in output_transformed]
            # In case there aren't the expected number of completions, we fill it with "", or short the list.
            # This shoulnd't happen if the LLM works as expected, but it's a safety measure as it can be
            # difficult to debug if the completions don't match the solutions.
            if len(examples) < self.N:
                examples.extend([""] * (self.N - len(examples)))  # type: ignore
            elif len(examples) > self.N:
                examples = examples[: self.N]
        else:
            output_transformed = (
                self._format_structured_output(output)[0]
                if self.use_default_structured_output
                else output
            )
            examples = [split_solution_steps(output_transformed)]
        return examples

    def _format_structured_output(self, output: str) -> list[str]:
        default_output = [""] * self.N if self.N else [""]
        if parsed_output := parse_json_response(output):
            solutions = parsed_output["solutions"]
            extracted_solutions = [solution["solution"] for solution in solutions]
            if len(output) != self.N:
                extracted_solutions = default_output
            return extracted_solutions
        return default_output

    def format_output(
        self,
        output: Union[str, None],
        input: Union[Dict[str, Any], None] = None,
    ) -> Dict[str, Any]:
        """Does nothing."""
        return {}

    def process(self, inputs: StepInput) -> "StepOutput":
        """Does the processing of generation completions for the solutions, and annotate
        each step with the logic found in Figure 2 of the paper, with the hard estimation (Eq. (3)).

        Args:
            inputs: Inputs to the step

        Yields:
            Annotated inputs with the completions.
        """

        # A list with all the inputs to be passed to the LLM. Needs another structure to
        # find them afterwards
        prepared_inputs = []
        # Data structure with the indices of the elements.
        # (i, j, k) where i is the input, j is the solution, and k is the completion
        input_positions = []
        golden_answers = []
        for i, input in enumerate(inputs):
            instruction = input["instruction"]
            golden_solution = input["golden_solution"]  # This is a single solution
            golden_answers.append(golden_solution[-1])
            # This contains a list of solutions
            solutions = input["solutions"]
            for j, solution in enumerate(solutions):
                # For each solution, that has K steps, we have to generate N completions
                # for the first K-2 steps (-2 because the last 2 steps are the last step, and
                # the answer itself, which can be directly compared against golden answer)
                prepared_completions = self._prepare_completions(instruction, solution)
                prepared_inputs.extend(prepared_completions)
                input_positions.extend(
                    [(i, j, k) for k in range(len(prepared_completions))]
                )

        # Send the elements in batches to the LLM to speed up the process
        final_outputs = []
        # Added here to simplify testing in case we don't have anything to process
        # TODO: Ensure the statistics has the same shape as all the outputs, raw_outputs, and raw_inputs
        statistics = []
        total_raw_outputs = []
        total_raw_inputs = []
        for inner_batch in batched(prepared_inputs, self.input_batch_size):  # type: ignore
            outputs = self.llm.generate_outputs(
                inputs=inner_batch,
                num_generations=1,
                **self.llm.get_generation_kwargs(),  # type: ignore
            )

            formatted_outputs = []
            stats = []
            raw_outputs = []
            raw_inputs = []
            for i, output in enumerate(outputs):
                generation = output["generations"][0]
                raw_inputs.append(inner_batch[i])
                raw_outputs.append(generation or "")
                formatted_outputs.append(self._parse_output(generation))
                stats.append(output["statistics"])

            final_outputs.extend(formatted_outputs)
            statistics.extend(stats)
            total_raw_outputs.extend(raw_outputs)
            total_raw_inputs.extend(raw_inputs)

        yield self._auto_label(  # type: ignore
            inputs,
            final_outputs,
            input_positions,
            golden_answers,
            statistics,
            total_raw_outputs,
            total_raw_inputs,
        )

    def _prepare_completions(
        self, instruction: str, steps: list[str]
    ) -> List["ChatType"]:
        """Helper method to create, given a solution (a list of steps), and a instruction, the
        texts to be completed by the LLM.

        Args:
            instruction: Instruction of the problem.
            steps: List of steps that are part of the solution.

        Returns:
            List of ChatType, where each ChatType is the prompt corresponding to one of the steps
            to be completed.
        """
        prepared_inputs = []
        # Use the number of completions that correspond to a given instruction/steps pair
        # to find afterwards the input that corresponds to a given completion (to do the labelling)
        num_completions = len(steps[:-2])
        for i in range(1, num_completions + 1):
            to_complete = instruction + " " + "\n".join(steps[:i])
            prepared_inputs.append(self.format_input({"instruction": to_complete}))

        return prepared_inputs

    def _auto_label(
        self,
        inputs: StepInput,
        final_outputs: list[Completions],
        input_positions: list[tuple[int, int, int]],
        golden_answers: list[str],
        statistics: list["LLMStatistics"],
        raw_outputs: list[str],
        raw_inputs: list[str],
    ) -> StepInput:
        """Labels the steps inplace (in the inputs), and returns the inputs.

        Args:
            inputs: The original inputs
            final_outputs: List of generations from the LLM.
                It's organized as a list where the elements sent to the LLM are
                grouped together, then each element contains the completions, and
                each completion is a list of steps.
            input_positions: A list with tuples generated in the process method
                that contains (i, j, k) where i is the index of the input, j is the
                index of the solution, and k is the index of the completion.
            golden_answers: List of golden answers for each input.
            statistics: List of statistics from the LLM.
            raw_outputs: List of raw outputs from the LLM.
            raw_inputs: List of raw inputs to the LLM.

        Returns:
            Inputs annotated.
        """
        for i, (instruction_i, solution_i, step_i) in enumerate(input_positions):
            input = inputs[instruction_i]
            solutions = input["solutions"]
            n_completions = final_outputs[i]
            label = f" {self.tags[1]}"
            for completion in n_completions:
                if len(completion) == 0:
                    # This can be a failed generation
                    label = ""  # Everyting stays the same
                    self._logger.info("Completer failed due to empty completion")
                    continue
                if completion[-1] == golden_answers[instruction_i]:
                    label = f" { self.tags[0]}"
                    # If we found one, it's enough as we are doing Hard Estimation
                    continue
            # In case we had no solutions from the previous step, otherwise we would have
            # an IndexError
            if not solutions[solution_i]:
                continue
            solutions[solution_i][step_i] += label
            inputs[instruction_i]["solutions"] = solutions

        for i, input in enumerate(inputs):
            solutions = input["solutions"]
            new_solutions = []
            for solution in solutions:
                if not solution or (len(solution) == 1):
                    # The generation may fail to generate the expected
                    # completions, or just added an extra empty completion,
                    # we skip it.
                    # Other possible error is having a list of solutions
                    # with a single item, so when we call .pop, we are left
                    # with an empty list, so we skip it too.
                    new_solutions.append(solution)
                    continue

                answer = solution.pop()
                label = (
                    f" {self.tags[0]}"
                    if answer == golden_answers[i]
                    else f" {self.tags[1]}"
                )
                solution[-1] += " " + answer + label
                new_solutions.append(solution)

            # Only add the solutions if the data was properly parsed
            input["solutions"] = new_solutions if new_solutions else input["solutions"]
            input = self._add_metadata(
                input, statistics[i], raw_outputs[i], raw_inputs[i]
            )

        return inputs

    def _add_metadata(
        self,
        input: dict[str, Any],
        statistics: list["LLMStatistics"],
        raw_output: Union[str, None],
        raw_input: Union[list[dict[str, Any]], None],
    ) -> dict[str, Any]:
        """Adds the `distilabel_metadata` to the input.

        This method comes for free in the general Tasks, but as we have reimplemented the `process`,
        we have to repeat it here.

        Args:
            input: The input to add the metadata to.
            statistics: The statistics from the LLM.
            raw_output: The raw output from the LLM.
            raw_input: The raw input to the LLM.

        Returns:
            The input with the metadata added if applies.
        """
        input["model_name"] = self.llm.model_name

        if DISTILABEL_METADATA_KEY not in input:
            input[DISTILABEL_METADATA_KEY] = {}
        # If the solutions are splitted afterwards, the statistics should be splitted
        # to avoid counting extra tokens
        input[DISTILABEL_METADATA_KEY][f"statistics_{self.name}"] = statistics

        # Let some defaults in case something failed and we had None, otherwise when reading
        # the parquet files using pyarrow, the following error will appear:
        # ArrowInvalid: Schema
        if self.add_raw_input:
            input[DISTILABEL_METADATA_KEY][f"raw_input_{self.name}"] = raw_input or [
                {"content": "", "role": ""}
            ]
        if self.add_raw_output:
            input[DISTILABEL_METADATA_KEY][f"raw_output_{self.name}"] = raw_output or ""
        return input

    @override
    def get_structured_output(self) -> dict[str, Any]:
        """Creates the json schema to be passed to the LLM, to enforce generating
        a dictionary with the output which can be directly parsed as a python dictionary.

        The schema corresponds to the following:

        ```python
        from pydantic import BaseModel, Field

        class Solution(BaseModel):
            solution: str = Field(..., description="Step by step solution leading to the final answer")

        class MathShepherdCompleter(BaseModel):
            solutions: list[Solution] = Field(..., description="List of solutions")

        MathShepherdCompleter.model_json_schema()
        ```

        Returns:
            JSON Schema of the response to enforce.
        """
        return {
            "$defs": {
                "Solution": {
                    "properties": {
                        "solution": {
                            "description": "Step by step solution leading to the final answer",
                            "title": "Solution",
                            "type": "string",
                        }
                    },
                    "required": ["solution"],
                    "title": "Solution",
                    "type": "object",
                }
            },
            "properties": {
                "solutions": {
                    "description": "List of solutions",
                    "items": {"$ref": "#/$defs/Solution"},
                    "title": "Solutions",
                    "type": "array",
                }
            },
            "required": ["solutions"],
            "title": "MathShepherdGenerator",
            "type": "object",
        }
format_output(output, input=None)

不做任何操作。

源代码位于 src/distilabel/steps/tasks/math_shepherd/completer.py
def format_output(
    self,
    output: Union[str, None],
    input: Union[Dict[str, Any], None] = None,
) -> Dict[str, Any]:
    """Does nothing."""
    return {}
process(inputs)

执行解决方案的补全生成处理,并使用论文图 2 中的逻辑(硬估计(公式 (3)))标注每个步骤。

参数

名称 类型 描述 默认值
inputs StepInput

步骤的输入

必需

产生

类型 描述
StepOutput

带有补全的已标注输入。

源代码位于 src/distilabel/steps/tasks/math_shepherd/completer.py
def process(self, inputs: StepInput) -> "StepOutput":
    """Does the processing of generation completions for the solutions, and annotate
    each step with the logic found in Figure 2 of the paper, with the hard estimation (Eq. (3)).

    Args:
        inputs: Inputs to the step

    Yields:
        Annotated inputs with the completions.
    """

    # A list with all the inputs to be passed to the LLM. Needs another structure to
    # find them afterwards
    prepared_inputs = []
    # Data structure with the indices of the elements.
    # (i, j, k) where i is the input, j is the solution, and k is the completion
    input_positions = []
    golden_answers = []
    for i, input in enumerate(inputs):
        instruction = input["instruction"]
        golden_solution = input["golden_solution"]  # This is a single solution
        golden_answers.append(golden_solution[-1])
        # This contains a list of solutions
        solutions = input["solutions"]
        for j, solution in enumerate(solutions):
            # For each solution, that has K steps, we have to generate N completions
            # for the first K-2 steps (-2 because the last 2 steps are the last step, and
            # the answer itself, which can be directly compared against golden answer)
            prepared_completions = self._prepare_completions(instruction, solution)
            prepared_inputs.extend(prepared_completions)
            input_positions.extend(
                [(i, j, k) for k in range(len(prepared_completions))]
            )

    # Send the elements in batches to the LLM to speed up the process
    final_outputs = []
    # Added here to simplify testing in case we don't have anything to process
    # TODO: Ensure the statistics has the same shape as all the outputs, raw_outputs, and raw_inputs
    statistics = []
    total_raw_outputs = []
    total_raw_inputs = []
    for inner_batch in batched(prepared_inputs, self.input_batch_size):  # type: ignore
        outputs = self.llm.generate_outputs(
            inputs=inner_batch,
            num_generations=1,
            **self.llm.get_generation_kwargs(),  # type: ignore
        )

        formatted_outputs = []
        stats = []
        raw_outputs = []
        raw_inputs = []
        for i, output in enumerate(outputs):
            generation = output["generations"][0]
            raw_inputs.append(inner_batch[i])
            raw_outputs.append(generation or "")
            formatted_outputs.append(self._parse_output(generation))
            stats.append(output["statistics"])

        final_outputs.extend(formatted_outputs)
        statistics.extend(stats)
        total_raw_outputs.extend(raw_outputs)
        total_raw_inputs.extend(raw_inputs)

    yield self._auto_label(  # type: ignore
        inputs,
        final_outputs,
        input_positions,
        golden_answers,
        statistics,
        total_raw_outputs,
        total_raw_inputs,
    )
_prepare_completions(instruction, steps)

辅助方法,用于创建给定解决方案(步骤列表)和指令的要由 LLM 完成的文本。

参数

名称 类型 描述 默认值
instruction str

问题的指令。

必需
steps list[str]

作为解决方案一部分的步骤列表。

必需

返回

类型 描述
List[ChatType]

ChatType 列表,其中每个 ChatType 都是对应于要补全的步骤之一的提示

List[ChatType]

要补全。

源代码位于 src/distilabel/steps/tasks/math_shepherd/completer.py
def _prepare_completions(
    self, instruction: str, steps: list[str]
) -> List["ChatType"]:
    """Helper method to create, given a solution (a list of steps), and a instruction, the
    texts to be completed by the LLM.

    Args:
        instruction: Instruction of the problem.
        steps: List of steps that are part of the solution.

    Returns:
        List of ChatType, where each ChatType is the prompt corresponding to one of the steps
        to be completed.
    """
    prepared_inputs = []
    # Use the number of completions that correspond to a given instruction/steps pair
    # to find afterwards the input that corresponds to a given completion (to do the labelling)
    num_completions = len(steps[:-2])
    for i in range(1, num_completions + 1):
        to_complete = instruction + " " + "\n".join(steps[:i])
        prepared_inputs.append(self.format_input({"instruction": to_complete}))

    return prepared_inputs
_auto_label(inputs, final_outputs, input_positions, golden_answers, statistics, raw_outputs, raw_inputs)

就地(在输入中)标注步骤,并返回输入。

参数

名称 类型 描述 默认值
inputs StepInput

原始输入

必需
final_outputs list[Completions]

来自 LLM 的生成列表。它被组织为一个列表,其中发送到 LLM 的元素被分组在一起,然后每个元素包含补全,并且每个补全都是步骤列表。

必需
input_positions list[tuple[int, int, int]]

一个列表,其中包含在 process 方法中生成的元组 (i, j, k),其中 i 是输入的索引,j 是解决方案的索引,k 是补全的索引。

必需
golden_answers list[str]

每个输入的黄金答案列表。

必需
statistics list[LLMStatistics]

来自 LLM 的统计信息列表。

必需
raw_outputs list[str]

来自 LLM 的原始输出列表。

必需
raw_inputs list[str]

到 LLM 的原始输入列表。

必需

返回

类型 描述
StepInput

已标注的输入。

源代码位于 src/distilabel/steps/tasks/math_shepherd/completer.py
def _auto_label(
    self,
    inputs: StepInput,
    final_outputs: list[Completions],
    input_positions: list[tuple[int, int, int]],
    golden_answers: list[str],
    statistics: list["LLMStatistics"],
    raw_outputs: list[str],
    raw_inputs: list[str],
) -> StepInput:
    """Labels the steps inplace (in the inputs), and returns the inputs.

    Args:
        inputs: The original inputs
        final_outputs: List of generations from the LLM.
            It's organized as a list where the elements sent to the LLM are
            grouped together, then each element contains the completions, and
            each completion is a list of steps.
        input_positions: A list with tuples generated in the process method
            that contains (i, j, k) where i is the index of the input, j is the
            index of the solution, and k is the index of the completion.
        golden_answers: List of golden answers for each input.
        statistics: List of statistics from the LLM.
        raw_outputs: List of raw outputs from the LLM.
        raw_inputs: List of raw inputs to the LLM.

    Returns:
        Inputs annotated.
    """
    for i, (instruction_i, solution_i, step_i) in enumerate(input_positions):
        input = inputs[instruction_i]
        solutions = input["solutions"]
        n_completions = final_outputs[i]
        label = f" {self.tags[1]}"
        for completion in n_completions:
            if len(completion) == 0:
                # This can be a failed generation
                label = ""  # Everyting stays the same
                self._logger.info("Completer failed due to empty completion")
                continue
            if completion[-1] == golden_answers[instruction_i]:
                label = f" { self.tags[0]}"
                # If we found one, it's enough as we are doing Hard Estimation
                continue
        # In case we had no solutions from the previous step, otherwise we would have
        # an IndexError
        if not solutions[solution_i]:
            continue
        solutions[solution_i][step_i] += label
        inputs[instruction_i]["solutions"] = solutions

    for i, input in enumerate(inputs):
        solutions = input["solutions"]
        new_solutions = []
        for solution in solutions:
            if not solution or (len(solution) == 1):
                # The generation may fail to generate the expected
                # completions, or just added an extra empty completion,
                # we skip it.
                # Other possible error is having a list of solutions
                # with a single item, so when we call .pop, we are left
                # with an empty list, so we skip it too.
                new_solutions.append(solution)
                continue

            answer = solution.pop()
            label = (
                f" {self.tags[0]}"
                if answer == golden_answers[i]
                else f" {self.tags[1]}"
            )
            solution[-1] += " " + answer + label
            new_solutions.append(solution)

        # Only add the solutions if the data was properly parsed
        input["solutions"] = new_solutions if new_solutions else input["solutions"]
        input = self._add_metadata(
            input, statistics[i], raw_outputs[i], raw_inputs[i]
        )

    return inputs
_add_metadata(input, statistics, raw_output, raw_input)

distilabel_metadata 添加到输入。

此方法在通用 Tasks 中是免费提供的,但由于我们重新实现了 process,因此我们必须在此处重复它。

参数

名称 类型 描述 默认值
input dict[str, Any]

要向其添加元数据的输入。

必需
statistics list[LLMStatistics]

来自 LLM 的统计信息。

必需
raw_output Union[str, None]

来自 LLM 的原始输出。

必需
raw_input Union[list[dict[str, Any]], None]

到 LLM 的原始输入。

必需

返回

类型 描述
dict[str, Any]

带有添加的元数据的输入(如果适用)。

源代码位于 src/distilabel/steps/tasks/math_shepherd/completer.py
def _add_metadata(
    self,
    input: dict[str, Any],
    statistics: list["LLMStatistics"],
    raw_output: Union[str, None],
    raw_input: Union[list[dict[str, Any]], None],
) -> dict[str, Any]:
    """Adds the `distilabel_metadata` to the input.

    This method comes for free in the general Tasks, but as we have reimplemented the `process`,
    we have to repeat it here.

    Args:
        input: The input to add the metadata to.
        statistics: The statistics from the LLM.
        raw_output: The raw output from the LLM.
        raw_input: The raw input to the LLM.

    Returns:
        The input with the metadata added if applies.
    """
    input["model_name"] = self.llm.model_name

    if DISTILABEL_METADATA_KEY not in input:
        input[DISTILABEL_METADATA_KEY] = {}
    # If the solutions are splitted afterwards, the statistics should be splitted
    # to avoid counting extra tokens
    input[DISTILABEL_METADATA_KEY][f"statistics_{self.name}"] = statistics

    # Let some defaults in case something failed and we had None, otherwise when reading
    # the parquet files using pyarrow, the following error will appear:
    # ArrowInvalid: Schema
    if self.add_raw_input:
        input[DISTILABEL_METADATA_KEY][f"raw_input_{self.name}"] = raw_input or [
            {"content": "", "role": ""}
        ]
    if self.add_raw_output:
        input[DISTILABEL_METADATA_KEY][f"raw_output_{self.name}"] = raw_output or ""
    return input
get_structured_output()

创建要传递给 LLM 的 json 模式,以强制生成一个字典,该字典的输出可以直接解析为 python 字典。

该模式对应于以下内容

from pydantic import BaseModel, Field

class Solution(BaseModel):
    solution: str = Field(..., description="Step by step solution leading to the final answer")

class MathShepherdCompleter(BaseModel):
    solutions: list[Solution] = Field(..., description="List of solutions")

MathShepherdCompleter.model_json_schema()

返回

类型 描述
dict[str, Any]

强制执行的响应的 JSON 模式。

源代码位于 src/distilabel/steps/tasks/math_shepherd/completer.py
@override
def get_structured_output(self) -> dict[str, Any]:
    """Creates the json schema to be passed to the LLM, to enforce generating
    a dictionary with the output which can be directly parsed as a python dictionary.

    The schema corresponds to the following:

    ```python
    from pydantic import BaseModel, Field

    class Solution(BaseModel):
        solution: str = Field(..., description="Step by step solution leading to the final answer")

    class MathShepherdCompleter(BaseModel):
        solutions: list[Solution] = Field(..., description="List of solutions")

    MathShepherdCompleter.model_json_schema()
    ```

    Returns:
        JSON Schema of the response to enforce.
    """
    return {
        "$defs": {
            "Solution": {
                "properties": {
                    "solution": {
                        "description": "Step by step solution leading to the final answer",
                        "title": "Solution",
                        "type": "string",
                    }
                },
                "required": ["solution"],
                "title": "Solution",
                "type": "object",
            }
        },
        "properties": {
            "solutions": {
                "description": "List of solutions",
                "items": {"$ref": "#/$defs/Solution"},
                "title": "Solutions",
                "type": "array",
            }
        },
        "required": ["solutions"],
        "title": "MathShepherdGenerator",
        "type": "object",
    }

MathShepherdGenerator

基类:Task

Math Shepherd 解决方案生成器。

此任务负责以 Math Shepherd Completer 任务期望的格式为给定指令生成补全。这些属性使该任务能够灵活地用于不同类型的数据集和 LLM,但我们提供了原始论文中提出的 GSM8K 和 MATH 数据集的示例。在修改它们之前,请查看当前的默认值,以确保补全生成正确。如果未提供黄金解决方案,则可以使用此任务为给定问题生成黄金解决方案,以及要由 Math Shepherd Completer 标注的可能解决方案。将仅生成 solutionsgolden_solution 中的一个,具体取决于 M 的值。

属性

名称 类型 描述
system_prompt Optional[str]

要在补全中使用的系统提示。默认的系统提示已经过检查,在使用 Llama 3.1 的 8B 和 70B 模型时生成了良好的补全,但可以对其进行修改以使其适应所选的模型和数据集。请注意,系统提示在 Jinja2 模板中包含 2 个变量:{{extra_rules}} 和 {{few_shot}}。这些变量用于包含额外的规则(例如,引导模型生成特定类型的响应)和少量示例(添加示例)。可以修改它们以使系统提示适应数据集和所使用的模型,而无需更改完整的系统提示。

extra_rules Optional[str]

此字段可用于插入与数据集类型相关的额外规则。例如,在原始论文中,他们使用了 GSM8K 和 MATH 数据集,此字段可用于插入 GSM8K 数据集的规则。

few_shots Optional[str]

少量示例,用于帮助模型生成补全,以您数据集所需的解决方案类型格式编写它们。

M Optional[PositiveInt]

为每个步骤生成的补全数量。默认设置为 1,这将生成“黄金解决方案”。在这种情况下,选择更强大的模型,因为它将用作标注期间的真实来源。如果 M 设置为大于 1 的数字,则该任务将生成要由 Math Shepherd Completer 任务标注的补全列表。

输入列
  • instruction (str): 任务或指令。
输出列
  • golden_solution (str): 指令的逐步解决方案。如果 M 等于 1,则将生成它。
  • solutions (List[List[str]]): 指令的可能解决方案列表。如果 M 大于 1,则将生成它。
  • model_name (str): 用于生成修订的模型名称。
类别
  • 文本生成
参考

示例

为给定指令生成解决方案(此处首选更强大的模型)

from distilabel.steps.tasks import MathShepherdGenerator
from distilabel.models import InferenceEndpointsLLM

llm=InferenceEndpointsLLM(
    model_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
    tokenizer_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
    generation_kwargs={
        "temperature": 0.6,
        "max_new_tokens": 1024,
    },
)
task = MathShepherdGenerator(
    name="golden_solution_generator",
    llm=llm,
)

task.load()

result = next(
    task.process(
        [
            {
                "instruction": "Janet’s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market?",
            },
        ]
    )
)
# [[{'instruction': "Janet’s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market?",
# 'golden_solution': '["Step 1: Janet sells 16 - 3 - 4 = <<16-3-4=9>>9 duck eggs a day.", "Step 2: She makes 9 * 2 = $<<9*2=18>>18 every day at the farmer\u2019s market.", "The answer is: 18"]'}]]

为给定指令生成 M 个补全(使用结构化输出生成)

from distilabel.steps.tasks import MathShepherdGenerator
from distilabel.models import InferenceEndpointsLLM

llm=InferenceEndpointsLLM(
    model_id="meta-llama/Meta-Llama-3.1-8B-Instruct",
    tokenizer_id="meta-llama/Meta-Llama-3.1-8B-Instruct",
    generation_kwargs={
        "temperature": 0.7,
        "max_new_tokens": 2048,
    },
)
task = MathShepherdGenerator(
    name="solution_generator",
    llm=llm,
    M=2,
    use_default_structured_output=True,
)

task.load()

result = next(
    task.process(
        [
            {
                "instruction": "Janet’s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market?",
            },
        ]
    )
)
# [[{'instruction': "Janet’s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market?",
# 'solutions': [["Step 1: Janet sells 16 - 3 - 4 = <<16-3-4=9>>9 duck eggs a day. -", "Step 2: She makes 9 * 2 = $<<9*2=18>>18 every day at the farmer\u2019s market.", "The answer is: 18"], ["Step 1: Janets ducks lay 16 eggs per day, and she uses 3 + 4 = <<3+4=7>>7 for eating and baking. +", "Step 2: So she sells 16 - 7 = <<16-7=9>>9 duck eggs every day. +", "Step 3: Those 9 eggs are worth 9 * $2 = $<<9*2=18>>18.", "The answer is: 18"]]}]]
源代码位于 src/distilabel/steps/tasks/math_shepherd/generator.py
class MathShepherdGenerator(Task):
    """Math Shepherd solution generator.

    This task is in charge of generating completions for a given instruction, in the format expected
    by the Math Shepherd Completer task. The attributes make the task flexible to be used with different
    types of dataset and LLMs, but we provide examples for the GSM8K and MATH datasets as presented
    in the original paper. Before modifying them, review the current defaults to ensure the completions
    are generated correctly. This task can be used to generate the golden solutions for a given problem if
    not provided, as well as possible solutions to be then labeled by the Math Shepherd Completer.
    Only one of `solutions` or `golden_solution` will be generated, depending on the value of M.

    Attributes:
        system_prompt: The system prompt to be used in the completions. The default one has been
            checked and generates good completions using Llama 3.1 with 8B and 70B,
            but it can be modified to adapt it to the model and dataset selected.
            Take into account that the system prompt includes 2 variables in the Jinja2 template,
            {{extra_rules}} and {{few_shot}}. These variables are used to include extra rules, for example
            to steer the model towards a specific type of responses, and few shots to add examples.
            They can be modified to adapt the system prompt to the dataset and model used without needing
            to change the full system prompt.
        extra_rules: This field can be used to insert extra rules relevant to the type of dataset.
            For example, in the original paper they used GSM8K and MATH datasets, and this field
            can be used to insert the rules for the GSM8K dataset.
        few_shots: Few shots to help the model generating the completions, write them in the
            format of the type of solutions wanted for your dataset.
        M: Number of completions to generate for each step. By default is set to 1, which will
            generate the "golden_solution". In this case select a stronger model, as it will be used
            as the source of true during labelling. If M is set to a number greater than 1, the task
            will generate a list of completions to be labeled by the Math Shepherd Completer task.

    Input columns:
        - instruction (`str`): The task or instruction.

    Output columns:
        - golden_solution (`str`): The step by step solution to the instruction.
            It will be generated if M is equal to 1.
        - solutions (`List[List[str]]`): A list of possible solutions to the instruction.
            It will be generated if M is greater than 1.
        - model_name (`str`): The name of the model used to generate the revision.

    Categories:
        - text-generation

    References:
        - [`Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations`](https://arxiv.org/abs/2312.08935)

    Examples:
        Generate the solution for a given instruction (prefer a stronger model here):

        ```python
        from distilabel.steps.tasks import MathShepherdGenerator
        from distilabel.models import InferenceEndpointsLLM

        llm=InferenceEndpointsLLM(
            model_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
            tokenizer_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
            generation_kwargs={
                "temperature": 0.6,
                "max_new_tokens": 1024,
            },
        )
        task = MathShepherdGenerator(
            name="golden_solution_generator",
            llm=llm,
        )

        task.load()

        result = next(
            task.process(
                [
                    {
                        "instruction": "Janet’s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market?",
                    },
                ]
            )
        )
        # [[{'instruction': "Janet’s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market?",
        # 'golden_solution': '["Step 1: Janet sells 16 - 3 - 4 = <<16-3-4=9>>9 duck eggs a day.", "Step 2: She makes 9 * 2 = $<<9*2=18>>18 every day at the farmer\\u2019s market.", "The answer is: 18"]'}]]
        ```

        Generate M completions for a given instruction (using structured output generation):

        ```python
        from distilabel.steps.tasks import MathShepherdGenerator
        from distilabel.models import InferenceEndpointsLLM

        llm=InferenceEndpointsLLM(
            model_id="meta-llama/Meta-Llama-3.1-8B-Instruct",
            tokenizer_id="meta-llama/Meta-Llama-3.1-8B-Instruct",
            generation_kwargs={
                "temperature": 0.7,
                "max_new_tokens": 2048,
            },
        )
        task = MathShepherdGenerator(
            name="solution_generator",
            llm=llm,
            M=2,
            use_default_structured_output=True,
        )

        task.load()

        result = next(
            task.process(
                [
                    {
                        "instruction": "Janet’s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market?",
                    },
                ]
            )
        )
        # [[{'instruction': "Janet’s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market?",
        # 'solutions': [["Step 1: Janet sells 16 - 3 - 4 = <<16-3-4=9>>9 duck eggs a day. -", "Step 2: She makes 9 * 2 = $<<9*2=18>>18 every day at the farmer\\u2019s market.", "The answer is: 18"], ["Step 1: Janets ducks lay 16 eggs per day, and she uses 3 + 4 = <<3+4=7>>7 for eating and baking. +", "Step 2: So she sells 16 - 7 = <<16-7=9>>9 duck eggs every day. +", "Step 3: Those 9 eggs are worth 9 * $2 = $<<9*2=18>>18.", "The answer is: 18"]]}]]
        ```
    """

    system_prompt: Optional[str] = SYSTEM_PROMPT
    extra_rules: Optional[str] = RULES_GSM8K
    few_shots: Optional[str] = FEW_SHOTS_GSM8K
    M: Optional[PositiveInt] = None

    def load(self) -> None:
        super().load()
        if self.system_prompt is not None:
            self.system_prompt = Template(self.system_prompt).render(
                extra_rules=self.extra_rules or "",
                few_shots=self.few_shots or "",
                structured_prompt=SYSTEM_PROMPT_STRUCTURED
                if self.use_default_structured_output
                else "",
            )
        if self.use_default_structured_output:
            self._template = Template(TEMPLATE_STRUCTURED)
        else:
            self._template = Template(TEMPLATE)

    @property
    def inputs(self) -> "StepColumns":
        return ["instruction"]

    @property
    def outputs(self) -> "StepColumns":
        if self.M:
            return ["solutions", "model_name"]
        return ["golden_solution", "model_name"]

    def format_input(self, input: Dict[str, Any]) -> "ChatType":
        messages = [
            {
                "role": "user",
                "content": self._template.render(
                    instruction=input["instruction"],
                    M=self.M,
                ),
            }
        ]
        if self.system_prompt:
            messages.insert(0, {"role": "system", "content": self.system_prompt})
        return messages

    def format_output(
        self, output: Union[str, None], input: Union[Dict[str, Any], None] = None
    ) -> Dict[str, Any]:
        output_name = "solutions" if self.M else "golden_solution"

        if output is None:
            input.update(**{output_name: None})
            return input

        if self.M:
            output_parsed = (
                self._format_structured_output(output)
                if self.use_default_structured_output
                else output.split("---")
            )
            solutions = [split_solution_steps(o) for o in output_parsed]
        else:
            output_parsed = (
                self._format_structured_output(output)[0]
                if self.use_default_structured_output
                else output
            )
            solutions = split_solution_steps(output_parsed)

        input.update(**{output_name: solutions})
        return input

    @override
    def get_structured_output(self) -> dict[str, Any]:
        """Creates the json schema to be passed to the LLM, to enforce generating
        a dictionary with the output which can be directly parsed as a python dictionary.

        The schema corresponds to the following:

        ```python
        from pydantic import BaseModel, Field

        class Solution(BaseModel):
            solution: str = Field(..., description="Step by step solution leading to the final answer")

        class MathShepherdGenerator(BaseModel):
            solutions: list[Solution] = Field(..., description="List of solutions")

        MathShepherdGenerator.model_json_schema()
        ```

        Returns:
            JSON Schema of the response to enforce.
        """
        return {
            "$defs": {
                "Solution": {
                    "properties": {
                        "solution": {
                            "description": "Step by step solution leading to the final answer",
                            "title": "Solution",
                            "type": "string",
                        }
                    },
                    "required": ["solution"],
                    "title": "Solution",
                    "type": "object",
                }
            },
            "properties": {
                "solutions": {
                    "description": "List of solutions",
                    "items": {"$ref": "#/$defs/Solution"},
                    "title": "Solutions",
                    "type": "array",
                }
            },
            "required": ["solutions"],
            "title": "MathShepherdGenerator",
            "type": "object",
        }

    def _format_structured_output(self, output: str) -> list[str]:
        default_output = [""] * self.M if self.M else [""]
        if parsed_output := parse_json_response(output):
            solutions = parsed_output["solutions"]
            extracted_solutions = [o["solution"] for o in solutions]
            if len(extracted_solutions) != self.M:
                extracted_solutions = default_output
            return extracted_solutions
        return default_output
get_structured_output()

创建要传递给 LLM 的 json 模式,以强制生成一个字典,该字典的输出可以直接解析为 python 字典。

该模式对应于以下内容

from pydantic import BaseModel, Field

class Solution(BaseModel):
    solution: str = Field(..., description="Step by step solution leading to the final answer")

class MathShepherdGenerator(BaseModel):
    solutions: list[Solution] = Field(..., description="List of solutions")

MathShepherdGenerator.model_json_schema()

返回

类型 描述
dict[str, Any]

强制执行的响应的 JSON 模式。

源代码位于 src/distilabel/steps/tasks/math_shepherd/generator.py
@override
def get_structured_output(self) -> dict[str, Any]:
    """Creates the json schema to be passed to the LLM, to enforce generating
    a dictionary with the output which can be directly parsed as a python dictionary.

    The schema corresponds to the following:

    ```python
    from pydantic import BaseModel, Field

    class Solution(BaseModel):
        solution: str = Field(..., description="Step by step solution leading to the final answer")

    class MathShepherdGenerator(BaseModel):
        solutions: list[Solution] = Field(..., description="List of solutions")

    MathShepherdGenerator.model_json_schema()
    ```

    Returns:
        JSON Schema of the response to enforce.
    """
    return {
        "$defs": {
            "Solution": {
                "properties": {
                    "solution": {
                        "description": "Step by step solution leading to the final answer",
                        "title": "Solution",
                        "type": "string",
                    }
                },
                "required": ["solution"],
                "title": "Solution",
                "type": "object",
            }
        },
        "properties": {
            "solutions": {
                "description": "List of solutions",
                "items": {"$ref": "#/$defs/Solution"},
                "title": "Solutions",
                "type": "array",
            }
        },
        "required": ["solutions"],
        "title": "MathShepherdGenerator",
        "type": "object",
    }

FormatPRM

基类:Step

辅助步骤,用于将数据转换为 PRM 模型期望的格式。

此步骤可用于将数据格式化为 2 种格式之一:遵循 peiyi9979/Math-Shepherd 中呈现的格式,在这种情况下,此步骤创建列 input 和 label,其中 input 是带有解决方案的指令(标签被令牌替换),label 是带有解决方案的指令,两者用换行符分隔。遵循 TRL 的训练格式,这将生成列 prompt、completions 和 labels。labels 对应于原始标签,这些标签被布尔值替换,其中 True 表示正确的步骤。

属性

名称 类型 描述
format Literal['math-shepherd', 'trl']

用于 PRM 模型的格式。“math-shepherd”对应于原始论文,而“trl”是为使用 TRL 训练模型而准备的格式。

step_token str

用作表示预测步骤分数位置的唯一令牌的字符串。

tags list[str]

表示正确和不正确步骤的标签列表。仅当它与 MathShepherdCompleter 中的默认值不同时才需要告知。

输入列
  • instruction (str): 任务或指令。
  • solutions (list[str]): 带有任务解决方案的步骤列表。
输出列
  • input (str): 带有解决方案的指令,其中标签被令牌替换。
  • label (str): 带有解决方案的指令。
  • prompt (str): 带有解决方案的指令,其中标签被令牌替换。
  • completions (List[str]): 以步骤列表形式表示的解决方案。
  • labels (List[bool]): 标签,以布尔值列表形式,其中 True 表示良好的响应。
类别
  • text-manipulation
  • columns
参考

示例

准备您的数据以使用 Math-Shepherd 格式训练 PRM 模型

from distilabel.steps.tasks import FormatPRM
from distilabel.steps import ExpandColumns

expand_columns = ExpandColumns(columns=["solutions"])
expand_columns.load()

# Define our PRM formatter
formatter = FormatPRM()
formatter.load()

# Expand the solutions column as it comes from the MathShepherdCompleter
result = next(
    expand_columns.process(
        [
            {
                "instruction": "Janet’s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market?",
                "solutions": [["Step 1: Determine the amount of blue fiber needed: 2 bolts of blue fiber are required. +", "Step 2: Calculate the amount of white fiber needed: Since it's half that much, we can divide 2 by 2: 2 / 2 = <<2/2=1>>1 bolt of white fiber. +", "Step 3: Add the amount of blue and white fiber: 2 (blue) + 1 (white) = <<2+1=3>>3 bolts of fiber in total. The answer is: 3 +"], ["Step 1: Determine the amount of blue fiber needed: 2 bolts of blue fiber are required. +", "Step 2: Calculate the amount of white fiber needed: Since it's half that much, we can multiply 2 by 0.5 (which is the same as dividing by 2): 2 * 0.5 = <<2*0.5=1>>1 bolt of white fiber. +", "Step 3: Add the amount of blue and white fiber: 2 (blue) + 1 (white) = <<2+1=3>>3 bolts of fiber in total. The answer is: 3 +"], ["Step 1: Determine the amount of blue fiber needed: 2 bolts of blue fiber are required. +", "Step 2: Calculate the amount of white fiber needed: Since it's half that much, we can multiply 2 by 0.5 (which is the same as dividing by 2): 2 * 0.5 = <<2*0.5=1>>1 bolt of white fiber. +", "Step 3: Add the amount of blue and white fiber: 2 (blue) + 1 (white) = <<2+1=3>>3 bolts of fiber in total. The answer is: 3 +"], ["Step 1: Determine the amount of blue fiber needed: 2 bolts of blue fiber are required. +", "Step 2: Calculate the amount of white fiber needed: Since it's half that much, we can multiply 2 by 0.5 (which is the same as dividing by 2): 2 * 0.5 = <<2*0.5=1>>1 bolt of white fiber. +", "Step 3: Add the amount of blue and white fiber: 2 (blue) + 1 (white) = <<2+1=3>>3 bolts of fiber in total. The answer is: 3 +"], ["Step 1: Determine the amount of blue fiber needed: 2 bolts of blue fiber are required. +", "Step 2: Calculate the amount of white fiber needed: Since it's half that much, we can divide 2 by 2: 2 / 2 = <<2/2=1>>1 bolt of white fiber. +", "Step 3: Add the amount of blue and white fiber: 2 (blue) + 1 (white) = <<2+1=3>>3 bolts of fiber in total. The answer is: 3 +"]]
            },
        ]
    )
)
result = next(formatter.process(result))

准备您的数据以使用 TRL 格式训练 PRM 模型

from distilabel.steps.tasks import FormatPRM
from distilabel.steps import ExpandColumns

expand_columns = ExpandColumns(columns=["solutions"])
expand_columns.load()

# Define our PRM formatter
formatter = FormatPRM(format="trl")
formatter.load()

# Expand the solutions column as it comes from the MathShepherdCompleter
result = next(
    expand_columns.process(
        [
            {
                "instruction": "Janet’s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market?",
                "solutions": [["Step 1: Determine the amount of blue fiber needed: 2 bolts of blue fiber are required. +", "Step 2: Calculate the amount of white fiber needed: Since it's half that much, we can divide 2 by 2: 2 / 2 = <<2/2=1>>1 bolt of white fiber. +", "Step 3: Add the amount of blue and white fiber: 2 (blue) + 1 (white) = <<2+1=3>>3 bolts of fiber in total. The answer is: 3 +"], ["Step 1: Determine the amount of blue fiber needed: 2 bolts of blue fiber are required. +", "Step 2: Calculate the amount of white fiber needed: Since it's half that much, we can multiply 2 by 0.5 (which is the same as dividing by 2): 2 * 0.5 = <<2*0.5=1>>1 bolt of white fiber. +", "Step 3: Add the amount of blue and white fiber: 2 (blue) + 1 (white) = <<2+1=3>>3 bolts of fiber in total. The answer is: 3 +"], ["Step 1: Determine the amount of blue fiber needed: 2 bolts of blue fiber are required. +", "Step 2: Calculate the amount of white fiber needed: Since it's half that much, we can multiply 2 by 0.5 (which is the same as dividing by 2): 2 * 0.5 = <<2*0.5=1>>1 bolt of white fiber. +", "Step 3: Add the amount of blue and white fiber: 2 (blue) + 1 (white) = <<2+1=3>>3 bolts of fiber in total. The answer is: 3 +"], ["Step 1: Determine the amount of blue fiber needed: 2 bolts of blue fiber are required. +", "Step 2: Calculate the amount of white fiber needed: Since it's half that much, we can multiply 2 by 0.5 (which is the same as dividing by 2): 2 * 0.5 = <<2*0.5=1>>1 bolt of white fiber. +", "Step 3: Add the amount of blue and white fiber: 2 (blue) + 1 (white) = <<2+1=3>>3 bolts of fiber in total. The answer is: 3 +"], ["Step 1: Determine the amount of blue fiber needed: 2 bolts of blue fiber are required. +", "Step 2: Calculate the amount of white fiber needed: Since it's half that much, we can divide 2 by 2: 2 / 2 = <<2/2=1>>1 bolt of white fiber. +", "Step 3: Add the amount of blue and white fiber: 2 (blue) + 1 (white) = <<2+1=3>>3 bolts of fiber in total. The answer is: 3 +"]]
            },
        ]
    )
)

result = next(formatter.process(result))
# {
#     "instruction": "Janet’s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market?",
#     "solutions": [
#         "Step 1: Determine the amount of blue fiber needed: 2 bolts of blue fiber are required. +",
#         "Step 2: Calculate the amount of white fiber needed: Since it's half that much, we can divide 2 by 2: 2 / 2 = <<2/2=1>>1 bolt of white fiber. +",
#         "Step 3: Add the amount of blue and white fiber: 2 (blue) + 1 (white) = <<2+1=3>>3 bolts of fiber in total. The answer is: 3 +"
#     ],
#     "prompt": "Janet’s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market?",
#     "completions": [
#         "Step 1: Determine the amount of blue fiber needed: 2 bolts of blue fiber are required.",
#         "Step 2: Calculate the amount of white fiber needed: Since it's half that much, we can divide 2 by 2: 2 / 2 = <<2/2=1>>1 bolt of white fiber.",
#         "Step 3: Add the amount of blue and white fiber: 2 (blue) + 1 (white) = <<2+1=3>>3 bolts of fiber in total. The answer is: 3"
#     ],
#     "labels": [
#         true,
#         true,
#         true
#     ]
# }

引用

```
@misc{wang2024mathshepherdverifyreinforcellms,
    title={Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations},
    author={Peiyi Wang and Lei Li and Zhihong Shao and R. X. Xu and Damai Dai and Yifei Li and Deli Chen and Y. Wu and Zhifang Sui},
    year={2024},
    eprint={2312.08935},
    archivePrefix={arXiv},
    primaryClass={cs.AI},
    url={https://arxiv.org/abs/2312.08935},
}
```
源代码位于 src/distilabel/steps/tasks/math_shepherd/utils.py
class FormatPRM(Step):
    """Helper step to transform the data into the format expected by the PRM model.

    This step can be used to format the data in one of 2 formats:
    Following the format presented
    in [peiyi9979/Math-Shepherd](https://hugging-face.cn/datasets/peiyi9979/Math-Shepherd?row=0),
    in which case this step creates the columns input and label, where the input is the instruction
    with the solution (and the tag replaced by a token), and the label is the instruction
    with the solution, both separated by a newline.
    Following TRL's format for training, which generates the columns prompt, completions, and labels.
    The labels correspond to the original tags replaced by boolean values, where True represents
    correct steps.

    Attributes:
        format: The format to use for the PRM model.
            "math-shepherd" corresponds to the original paper, while "trl" is a format
            prepared to train the model using TRL.
        step_token: String that serves as a unique token denoting the position
            for predicting the step score.
        tags: List of tags that represent the correct and incorrect steps.
            This only needs to be informed if it's different than the default in
            `MathShepherdCompleter`.

    Input columns:
        - instruction (`str`): The task or instruction.
        - solutions (`list[str]`): List of steps with a solution to the task.

    Output columns:
        - input (`str`): The instruction with the solutions, where the label tags
            are replaced by a token.
        - label (`str`): The instruction with the solutions.
        - prompt (`str`): The instruction with the solutions, where the label tags
            are replaced by a token.
        - completions (`List[str]`): The solution represented as a list of steps.
        - labels (`List[bool]`): The labels, as a list of booleans, where True represents
            a good response.

    Categories:
        - text-manipulation
        - columns

    References:
        - [`Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations`](https://arxiv.org/abs/2312.08935)
        - [peiyi9979/Math-Shepherd](https://hugging-face.cn/datasets/peiyi9979/Math-Shepherd?row=0)

    Examples:
        Prepare your data to train a PRM model with the Math-Shepherd format:

        ```python
        from distilabel.steps.tasks import FormatPRM
        from distilabel.steps import ExpandColumns

        expand_columns = ExpandColumns(columns=["solutions"])
        expand_columns.load()

        # Define our PRM formatter
        formatter = FormatPRM()
        formatter.load()

        # Expand the solutions column as it comes from the MathShepherdCompleter
        result = next(
            expand_columns.process(
                [
                    {
                        "instruction": "Janet’s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market?",
                        "solutions": [["Step 1: Determine the amount of blue fiber needed: 2 bolts of blue fiber are required. +", "Step 2: Calculate the amount of white fiber needed: Since it\'s half that much, we can divide 2 by 2: 2 / 2 = <<2/2=1>>1 bolt of white fiber. +", "Step 3: Add the amount of blue and white fiber: 2 (blue) + 1 (white) = <<2+1=3>>3 bolts of fiber in total. The answer is: 3 +"], ["Step 1: Determine the amount of blue fiber needed: 2 bolts of blue fiber are required. +", "Step 2: Calculate the amount of white fiber needed: Since it\'s half that much, we can multiply 2 by 0.5 (which is the same as dividing by 2): 2 * 0.5 = <<2*0.5=1>>1 bolt of white fiber. +", "Step 3: Add the amount of blue and white fiber: 2 (blue) + 1 (white) = <<2+1=3>>3 bolts of fiber in total. The answer is: 3 +"], ["Step 1: Determine the amount of blue fiber needed: 2 bolts of blue fiber are required. +", "Step 2: Calculate the amount of white fiber needed: Since it\'s half that much, we can multiply 2 by 0.5 (which is the same as dividing by 2): 2 * 0.5 = <<2*0.5=1>>1 bolt of white fiber. +", "Step 3: Add the amount of blue and white fiber: 2 (blue) + 1 (white) = <<2+1=3>>3 bolts of fiber in total. The answer is: 3 +"], ["Step 1: Determine the amount of blue fiber needed: 2 bolts of blue fiber are required. +", "Step 2: Calculate the amount of white fiber needed: Since it\'s half that much, we can multiply 2 by 0.5 (which is the same as dividing by 2): 2 * 0.5 = <<2*0.5=1>>1 bolt of white fiber. +", "Step 3: Add the amount of blue and white fiber: 2 (blue) + 1 (white) = <<2+1=3>>3 bolts of fiber in total. The answer is: 3 +"], ["Step 1: Determine the amount of blue fiber needed: 2 bolts of blue fiber are required. +", "Step 2: Calculate the amount of white fiber needed: Since it\'s half that much, we can divide 2 by 2: 2 / 2 = <<2/2=1>>1 bolt of white fiber. +", "Step 3: Add the amount of blue and white fiber: 2 (blue) + 1 (white) = <<2+1=3>>3 bolts of fiber in total. The answer is: 3 +"]]
                    },
                ]
            )
        )
        result = next(formatter.process(result))
        ```

        Prepare your data to train a PRM model with the TRL format:

        ```python
        from distilabel.steps.tasks import FormatPRM
        from distilabel.steps import ExpandColumns

        expand_columns = ExpandColumns(columns=["solutions"])
        expand_columns.load()

        # Define our PRM formatter
        formatter = FormatPRM(format="trl")
        formatter.load()

        # Expand the solutions column as it comes from the MathShepherdCompleter
        result = next(
            expand_columns.process(
                [
                    {
                        "instruction": "Janet’s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market?",
                        "solutions": [["Step 1: Determine the amount of blue fiber needed: 2 bolts of blue fiber are required. +", "Step 2: Calculate the amount of white fiber needed: Since it\'s half that much, we can divide 2 by 2: 2 / 2 = <<2/2=1>>1 bolt of white fiber. +", "Step 3: Add the amount of blue and white fiber: 2 (blue) + 1 (white) = <<2+1=3>>3 bolts of fiber in total. The answer is: 3 +"], ["Step 1: Determine the amount of blue fiber needed: 2 bolts of blue fiber are required. +", "Step 2: Calculate the amount of white fiber needed: Since it\'s half that much, we can multiply 2 by 0.5 (which is the same as dividing by 2): 2 * 0.5 = <<2*0.5=1>>1 bolt of white fiber. +", "Step 3: Add the amount of blue and white fiber: 2 (blue) + 1 (white) = <<2+1=3>>3 bolts of fiber in total. The answer is: 3 +"], ["Step 1: Determine the amount of blue fiber needed: 2 bolts of blue fiber are required. +", "Step 2: Calculate the amount of white fiber needed: Since it\'s half that much, we can multiply 2 by 0.5 (which is the same as dividing by 2): 2 * 0.5 = <<2*0.5=1>>1 bolt of white fiber. +", "Step 3: Add the amount of blue and white fiber: 2 (blue) + 1 (white) = <<2+1=3>>3 bolts of fiber in total. The answer is: 3 +"], ["Step 1: Determine the amount of blue fiber needed: 2 bolts of blue fiber are required. +", "Step 2: Calculate the amount of white fiber needed: Since it\'s half that much, we can multiply 2 by 0.5 (which is the same as dividing by 2): 2 * 0.5 = <<2*0.5=1>>1 bolt of white fiber. +", "Step 3: Add the amount of blue and white fiber: 2 (blue) + 1 (white) = <<2+1=3>>3 bolts of fiber in total. The answer is: 3 +"], ["Step 1: Determine the amount of blue fiber needed: 2 bolts of blue fiber are required. +", "Step 2: Calculate the amount of white fiber needed: Since it\'s half that much, we can divide 2 by 2: 2 / 2 = <<2/2=1>>1 bolt of white fiber. +", "Step 3: Add the amount of blue and white fiber: 2 (blue) + 1 (white) = <<2+1=3>>3 bolts of fiber in total. The answer is: 3 +"]]
                    },
                ]
            )
        )

        result = next(formatter.process(result))
        # {
        #     "instruction": "Janet\u2019s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market?",
        #     "solutions": [
        #         "Step 1: Determine the amount of blue fiber needed: 2 bolts of blue fiber are required. +",
        #         "Step 2: Calculate the amount of white fiber needed: Since it's half that much, we can divide 2 by 2: 2 / 2 = <<2/2=1>>1 bolt of white fiber. +",
        #         "Step 3: Add the amount of blue and white fiber: 2 (blue) + 1 (white) = <<2+1=3>>3 bolts of fiber in total. The answer is: 3 +"
        #     ],
        #     "prompt": "Janet\u2019s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market?",
        #     "completions": [
        #         "Step 1: Determine the amount of blue fiber needed: 2 bolts of blue fiber are required.",
        #         "Step 2: Calculate the amount of white fiber needed: Since it's half that much, we can divide 2 by 2: 2 / 2 = <<2/2=1>>1 bolt of white fiber.",
        #         "Step 3: Add the amount of blue and white fiber: 2 (blue) + 1 (white) = <<2+1=3>>3 bolts of fiber in total. The answer is: 3"
        #     ],
        #     "labels": [
        #         true,
        #         true,
        #         true
        #     ]
        # }
        ```

    Citations:

        ```
        @misc{wang2024mathshepherdverifyreinforcellms,
            title={Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations},
            author={Peiyi Wang and Lei Li and Zhihong Shao and R. X. Xu and Damai Dai and Yifei Li and Deli Chen and Y. Wu and Zhifang Sui},
            year={2024},
            eprint={2312.08935},
            archivePrefix={arXiv},
            primaryClass={cs.AI},
            url={https://arxiv.org/abs/2312.08935},
        }
        ```
    """

    format: Literal["math-shepherd", "trl"] = "math-shepherd"
    step_token: str = "ки"
    tags: list[str] = ["+", "-"]

    def model_post_init(self, __context: Any) -> None:
        super().model_post_init(__context)
        if self.format == "math-shepherd":
            self._formatter = self._format_math_shepherd
        else:
            self._formatter = self._format_trl

    @property
    def inputs(self) -> "StepColumns":
        return ["instruction", "solutions"]

    @property
    def outputs(self) -> "StepColumns":
        if self.format == "math-shepherd":
            return ["input", "label"]
        return ["prompt", "completions", "labels"]

    def process(self, inputs: StepInput) -> "StepOutput":  # type: ignore
        """The process prepares the data for the `APIGenGenerator` task.

        If a single example is provided, it is copied to avoid raising an error.

        Args:
            inputs: A list of dictionaries with the input data.

        Yields:
            A list of dictionaries with the output data.
        """
        for input in inputs:
            self._formatter(input)

        yield inputs  # type: ignore

    def _format_math_shepherd(
        self, input: dict[str, str]
    ) -> dict[str, Union[str, list[str]]]:
        instruction = input["instruction"]
        replaced = []
        # At this stage, the "solutions" column can only contain a single solution,
        # and the last item of each solution is the tag.
        solution = input["solutions"]
        for step in solution:
            # Check there's a string, because the step that generated
            # the solutions could have failed, and we would have an empty list.
            replaced.append(step[:-1] + self.step_token if len(step) > 1 else step)

        input["input"] = instruction + " " + "\n".join(replaced)
        input["label"] = instruction + " " + "\n".join(solution)

        return input  # type: ignore

    def _format_trl(
        self, input: dict[str, str]
    ) -> dict[str, Union[str, list[str], list[bool]]]:
        input["prompt"] = input["instruction"]
        completions: list[str] = []
        labels: list[bool] = []
        for step in input["solutions"]:
            token = step[-1]
            completions.append(step[:-1].strip())
            labels.append(True if token == self.tags[0] else False)

        input["completions"] = completions  # type: ignore
        input["labels"] = labels  # type: ignore

        return input  # type: ignore
process(inputs)

该过程为 APIGenGenerator 任务准备数据。

如果提供了单个示例,则会复制它以避免引发错误。

参数

名称 类型 描述 默认值
inputs StepInput

包含输入数据的字典列表。

必需

产生

类型 描述
StepOutput

包含输出数据的字典列表。

源代码位于 src/distilabel/steps/tasks/math_shepherd/utils.py
def process(self, inputs: StepInput) -> "StepOutput":  # type: ignore
    """The process prepares the data for the `APIGenGenerator` task.

    If a single example is provided, it is copied to avoid raising an error.

    Args:
        inputs: A list of dictionaries with the input data.

    Yields:
        A list of dictionaries with the output data.
    """
    for input in inputs:
        self._formatter(input)

    yield inputs  # type: ignore

PairRM

基类: Step

使用 LLM 模型根据输入对候选者进行排名。

属性

名称 类型 描述
model str

用于排名的模型。默认为 "llm-blender/PairRM"

instructions Optional[str]

用于模型的指令。默认为 None

输入列
  • inputs (List[Dict[str, Any]]): 要为其候选者排名的输入文本或对话。
  • candidates (List[Dict[str, Any]]): 要排名的候选者。
输出列
  • ranks (List[int]): 基于输入的候选者排名。
  • ranked_candidates (List[Dict[str, Any]]): 基于输入排名的候选者。
  • model_name (str): 用于对候选响应进行排名的模型名称。默认为 "llm-blender/PairRM"
参考
类别
  • preference
注意

此步骤与其他任务不同,因为目前此模型只有一个实现,我们将使用特定的 LLM

示例

对 LLM 候选对象进行排名

from distilabel.steps.tasks import PairRM

# Consider this as a placeholder for your actual LLM.
pair_rm = PairRM()

pair_rm.load()

result = next(
    scorer.process(
        [
            {"input": "Hello, how are you?", "candidates": ["fine", "good", "bad"]},
        ]
    )
)
# result
# [
#     {
#         'input': 'Hello, how are you?',
#         'candidates': ['fine', 'good', 'bad'],
#         'ranks': [2, 1, 3],
#         'ranked_candidates': ['good', 'fine', 'bad'],
#         'model_name': 'llm-blender/PairRM',
#     }
# ]
引用
@misc{jiang2023llmblenderensemblinglargelanguage,
    title={LLM-Blender: Ensembling Large Language Models with Pairwise Ranking and Generative Fusion},
    author={Dongfu Jiang and Xiang Ren and Bill Yuchen Lin},
    year={2023},
    eprint={2306.02561},
    archivePrefix={arXiv},
    primaryClass={cs.CL},
    url={https://arxiv.org/abs/2306.02561},
}
源代码位于 src/distilabel/steps/tasks/pair_rm.py
class PairRM(Step):
    """Rank the candidates based on the input using the `LLM` model.

    Attributes:
        model: The model to use for the ranking. Defaults to `"llm-blender/PairRM"`.
        instructions: The instructions to use for the model. Defaults to `None`.

    Input columns:
        - inputs (`List[Dict[str, Any]]`): The input text or conversation to rank the candidates for.
        - candidates (`List[Dict[str, Any]]`): The candidates to rank.

    Output columns:
        - ranks (`List[int]`): The ranks of the candidates based on the input.
        - ranked_candidates (`List[Dict[str, Any]]`): The candidates ranked based on the input.
        - model_name (`str`): The model name used to rank the candidate responses. Defaults to `"llm-blender/PairRM"`.

    References:
        - [LLM-Blender: Ensembling Large Language Models with Pairwise Ranking and Generative Fusion](https://arxiv.org/abs/2306.02561).
        - [Pair Ranking Model](https://hugging-face.cn/llm-blender/PairRM).

    Categories:
        - preference

    Note:
        This step differs to other tasks as there is a single implementation of this model
        currently, and we will use a specific `LLM`.

    Examples:
        Rank LLM candidates:

        ```python
        from distilabel.steps.tasks import PairRM

        # Consider this as a placeholder for your actual LLM.
        pair_rm = PairRM()

        pair_rm.load()

        result = next(
            scorer.process(
                [
                    {"input": "Hello, how are you?", "candidates": ["fine", "good", "bad"]},
                ]
            )
        )
        # result
        # [
        #     {
        #         'input': 'Hello, how are you?',
        #         'candidates': ['fine', 'good', 'bad'],
        #         'ranks': [2, 1, 3],
        #         'ranked_candidates': ['good', 'fine', 'bad'],
        #         'model_name': 'llm-blender/PairRM',
        #     }
        # ]
        ```

    Citations:
        ```
        @misc{jiang2023llmblenderensemblinglargelanguage,
            title={LLM-Blender: Ensembling Large Language Models with Pairwise Ranking and Generative Fusion},
            author={Dongfu Jiang and Xiang Ren and Bill Yuchen Lin},
            year={2023},
            eprint={2306.02561},
            archivePrefix={arXiv},
            primaryClass={cs.CL},
            url={https://arxiv.org/abs/2306.02561},
        }
        ```
    """

    model: str = "llm-blender/PairRM"
    instructions: Optional[str] = None

    def load(self) -> None:
        """Loads the PairRM model provided via `model` with `llm_blender.Blender`, which is the
        custom library for running the inference for the PairRM models."""
        try:
            import llm_blender
        except ImportError as e:
            raise ImportError(
                "The `llm_blender` package is required to use the `PairRM` class."
                "Please install it with `pip install git+https://github.com/yuchenlin/LLM-Blender.git`."
            ) from e

        self._blender = llm_blender.Blender()
        self._blender.loadranker(self.model)

    @property
    def inputs(self) -> "StepColumns":
        """The input columns correspond to the two required arguments from `Blender.rank`:
        `inputs` and `candidates`."""
        return ["input", "candidates"]

    @property
    def outputs(self) -> "StepColumns":
        """The outputs will include the `ranks` and the `ranked_candidates`."""
        return ["ranks", "ranked_candidates", "model_name"]

    def format_input(self, input: Dict[str, Any]) -> Dict[str, Any]:
        """The input is expected to be a dictionary with the keys `input` and `candidates`,
        where the `input` corresponds to the instruction of a model and `candidates` are a
        list of responses to be ranked.
        """
        return {"input": input["input"], "candidates": input["candidates"]}

    def process(self, inputs: StepInput) -> "StepOutput":  # type: ignore
        """Generates the ranks for the candidates based on the input.

        The ranks are the positions of the candidates, where lower is better,
        and the ranked candidates correspond to the candidates sorted according to the
        ranks obtained.

        Args:
            inputs: A list of Python dictionaries with the inputs of the task.

        Yields:
            An iterator with the inputs containing the `ranks`, `ranked_candidates`, and `model_name`.
        """
        input_texts = []
        candidates = []
        for input in inputs:
            formatted_input = self.format_input(input)
            input_texts.append(formatted_input["input"])
            candidates.append(formatted_input["candidates"])

        instructions = (
            [self.instructions] * len(input_texts) if self.instructions else None
        )

        ranks = self._blender.rank(
            input_texts,
            candidates,
            instructions=instructions,
            return_scores=False,
            batch_size=self.input_batch_size,
        )
        # Sort the candidates based on the ranks
        ranked_candidates = np.take_along_axis(
            np.array(candidates), ranks - 1, axis=1
        ).tolist()
        ranks = ranks.tolist()
        for input, rank, ranked_candidate in zip(inputs, ranks, ranked_candidates):
            input["ranks"] = rank
            input["ranked_candidates"] = ranked_candidate
            input["model_name"] = self.model

        yield inputs
inputs property

输入列对应于 Blender.rank 中的两个必需参数:inputscandidates

outputs property

输出将包括 ranksranked_candidates

load()

使用 llm_blender.Blender 加载通过 model 提供的 PairRM 模型,这是用于运行 PairRM 模型推理的自定义库。

源代码位于 src/distilabel/steps/tasks/pair_rm.py
def load(self) -> None:
    """Loads the PairRM model provided via `model` with `llm_blender.Blender`, which is the
    custom library for running the inference for the PairRM models."""
    try:
        import llm_blender
    except ImportError as e:
        raise ImportError(
            "The `llm_blender` package is required to use the `PairRM` class."
            "Please install it with `pip install git+https://github.com/yuchenlin/LLM-Blender.git`."
        ) from e

    self._blender = llm_blender.Blender()
    self._blender.loadranker(self.model)
format_input(input)

输入应为包含键 inputcandidates 的字典,其中 input 对应于模型的指令,candidates 是要排名的响应列表。

源代码位于 src/distilabel/steps/tasks/pair_rm.py
def format_input(self, input: Dict[str, Any]) -> Dict[str, Any]:
    """The input is expected to be a dictionary with the keys `input` and `candidates`,
    where the `input` corresponds to the instruction of a model and `candidates` are a
    list of responses to be ranked.
    """
    return {"input": input["input"], "candidates": input["candidates"]}
process(inputs)

根据输入生成候选者的排名。

排名是候选者的位置,其中较低的排名更好,而排名后的候选者对应于根据获得的排名排序的候选者。

参数

名称 类型 描述 默认值
inputs StepInput

包含任务输入的 Python 字典列表。

必需

产生

类型 描述
StepOutput

包含 ranksranked_candidatesmodel_name 的输入迭代器。

源代码位于 src/distilabel/steps/tasks/pair_rm.py
def process(self, inputs: StepInput) -> "StepOutput":  # type: ignore
    """Generates the ranks for the candidates based on the input.

    The ranks are the positions of the candidates, where lower is better,
    and the ranked candidates correspond to the candidates sorted according to the
    ranks obtained.

    Args:
        inputs: A list of Python dictionaries with the inputs of the task.

    Yields:
        An iterator with the inputs containing the `ranks`, `ranked_candidates`, and `model_name`.
    """
    input_texts = []
    candidates = []
    for input in inputs:
        formatted_input = self.format_input(input)
        input_texts.append(formatted_input["input"])
        candidates.append(formatted_input["candidates"])

    instructions = (
        [self.instructions] * len(input_texts) if self.instructions else None
    )

    ranks = self._blender.rank(
        input_texts,
        candidates,
        instructions=instructions,
        return_scores=False,
        batch_size=self.input_batch_size,
    )
    # Sort the candidates based on the ranks
    ranked_candidates = np.take_along_axis(
        np.array(candidates), ranks - 1, axis=1
    ).tolist()
    ranks = ranks.tolist()
    for input, rank, ranked_candidate in zip(inputs, ranks, ranked_candidates):
        input["ranks"] = rank
        input["ranked_candidates"] = ranked_candidate
        input["model_name"] = self.model

    yield inputs

PrometheusEval

基类:Task

使用 Prometheus 2.0 评判和排名 LLM 生成的质量。

PrometheusEval 是为 Prometheus 2.0 创建的任务,涵盖绝对评估和相对评估。绝对评估,即 mode="absolute",用于评估 LLM 对给定指令的单个生成结果。相对评估,即 mode="relative",用于评估 LLM 对给定指令的两个生成结果。两种评估都可能使用参考答案进行比较,无论是否使用 reference 属性,并且两者都基于评分标准,该标准根据以下默认方面评判生成结果:helpfulnessharmlessnesshonestyfactual-validityreasoning,这些方面可以通过 rubrics 覆盖,并且通过属性 rubric 设置所选标准。

注意

PrometheusEval 任务更适合且旨在与 Kaist AI 发布的任何 Prometheus 2.0 模型一起使用,即:https://hugging-face.cn/prometheus-eval/prometheus-7b-v2.0 和 https://hugging-face.cn/prometheus-eval/prometheus-8x7b-v2.0。如果使用其他模型,则无法保证评判评估的格式和质量,尽管某些其他模型也可能能够正确遵循格式并生成有见地的评判。

属性

名称 类型 描述
mode Literal['absolute', 'relative']

要使用的评估模式,absoluterelative。它定义了任务将评估一个还是两个生成结果。

rubric str

要在提示中使用的评分标准,以基于不同方面运行评判。可以是 rubrics 属性中任何现有的键,默认情况下,这意味着它可以是:helpfulnessharmlessnesshonestyfactual-validityreasoning。这些仅在使用默认 rubrics 时才有效,否则,应使用提供的 rubrics

rubrics Optional[Dict[str, str]]

包含用于评判的不同标准的字典,其中键是标准名称,值是标准描述。默认标准如下:helpfulnessharmlessnesshonestyfactual-validityreasoning

reference bool

一个布尔标志,指示是否将提供参考答案/补全,以便模型评判基于与其的比较。这意味着除了其余输入之外,还需要在输入数据中提供列 reference

_template Union[Template, None]

用于格式化 LLM 输入的 Jinja2 模板。

输入列
  • instruction (str): 用作参考的指令。
  • generation (str, optional): 来自给定 instruction 的生成文本。如果 mode=absolute,则此列是必需的。
  • generations (List[str], optional): 来自给定 instruction 的生成文本。它应仅包含 2 个生成结果。如果 mode=relative,则此列是必需的。
  • reference (str, optional): 用于 LLM 与之比较的 instruction 的参考答案/黄金答案。
输出列
  • feedback (str): 反馈,解释了以下结果,由 LLM 使用预定义的评分标准评判,并与提供的 reference 进行比较(如果提供)。
  • result (Union[int, Literal["A", "B"]]): 如果 mode=absolute,则结果包含 generation 的评分,采用 1-5 的李克特量表;否则,如果 mode=relative,则结果包含 “A” 或 “B”,“获胜”者是 generations 索引 0 中的生成结果(如果 result='A')或索引 1 中的生成结果(如果 result='B')。
  • model_name (str): 用于生成 feedbackresult 的模型名称。
类别
  • critique
  • preference
参考

示例

使用 Prometheus 2_0 评判和评估 LLM 生成质量

from distilabel.steps.tasks import PrometheusEval
from distilabel.models import vLLM

# Consider this as a placeholder for your actual LLM.
prometheus = PrometheusEval(
    llm=vLLM(
        model="prometheus-eval/prometheus-7b-v2.0",
        chat_template="[INST] {{ messages[0]"content" }}\n{{ messages[1]"content" }}[/INST]",
    ),
    mode="absolute",
    rubric="factual-validity"
)

prometheus.load()

result = next(
    prometheus.process(
        [
            {"instruction": "make something", "generation": "something done"},
        ]
    )
)
# result
# [
#     {
#         'instruction': 'make something',
#         'generation': 'something done',
#         'model_name': 'prometheus-eval/prometheus-7b-v2.0',
#         'feedback': 'the feedback',
#         'result': 6,
#     }
# ]

相对评估的评判

from distilabel.steps.tasks import PrometheusEval
from distilabel.models import vLLM

# Consider this as a placeholder for your actual LLM.
prometheus = PrometheusEval(
    llm=vLLM(
        model="prometheus-eval/prometheus-7b-v2.0",
        chat_template="[INST] {{ messages[0]"content" }}\n{{ messages[1]"content" }}[/INST]",
    ),
    mode="relative",
    rubric="honesty"
)

prometheus.load()

result = next(
    prometheus.process(
        [
            {"instruction": "make something", "generations": ["something done", "other thing"]},
        ]
    )
)
# result
# [
#     {
#         'instruction': 'make something',
#         'generations': ['something done', 'other thing'],
#         'model_name': 'prometheus-eval/prometheus-7b-v2.0',
#         'feedback': 'the feedback',
#         'result': 'something done',
#     }
# ]

使用自定义标准进行评判

from distilabel.steps.tasks import PrometheusEval
from distilabel.models import vLLM

# Consider this as a placeholder for your actual LLM.
prometheus = PrometheusEval(
    llm=vLLM(
        model="prometheus-eval/prometheus-7b-v2.0",
        chat_template="[INST] {{ messages[0]"content" }}\n{{ messages[1]"content" }}[/INST]",
    ),
    mode="absolute",
    rubric="custom",
    rubrics={
        "custom": "[A]\nScore 1: A\nScore 2: B\nScore 3: C\nScore 4: D\nScore 5: E"
    }
)

prometheus.load()

result = next(
    prometheus.process(
        [
            {"instruction": "make something", "generation": "something done"},
        ]
    )
)
# result
# [
#     {
#         'instruction': 'make something',
#         'generation': 'something done',
#         'model_name': 'prometheus-eval/prometheus-7b-v2.0',
#         'feedback': 'the feedback',
#         'result': 6,
#     }
# ]

使用参考答案进行评判

from distilabel.steps.tasks import PrometheusEval
from distilabel.models import vLLM

# Consider this as a placeholder for your actual LLM.
prometheus = PrometheusEval(
    llm=vLLM(
        model="prometheus-eval/prometheus-7b-v2.0",
        chat_template="[INST] {{ messages[0]"content" }}\n{{ messages[1]"content" }}[/INST]",
    ),
    mode="absolute",
    rubric="helpfulness",
    reference=True,
)

prometheus.load()

result = next(
    prometheus.process(
        [
            {
                "instruction": "make something",
                "generation": "something done",
                "reference": "this is a reference answer",
            },
        ]
    )
)
# result
# [
#     {
#         'instruction': 'make something',
#         'generation': 'something done',
#         'reference': 'this is a reference answer',
#         'model_name': 'prometheus-eval/prometheus-7b-v2.0',
#         'feedback': 'the feedback',
#         'result': 6,
#     }
# ]
引用
@misc{kim2024prometheus2opensource,
    title={Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models},
    author={Seungone Kim and Juyoung Suk and Shayne Longpre and Bill Yuchen Lin and Jamin Shin and Sean Welleck and Graham Neubig and Moontae Lee and Kyungjae Lee and Minjoon Seo},
    year={2024},
    eprint={2405.01535},
    archivePrefix={arXiv},
    primaryClass={cs.CL},
    url={https://arxiv.org/abs/2405.01535},
}
源代码位于 src/distilabel/steps/tasks/prometheus_eval.py
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
class PrometheusEval(Task):
    """Critique and rank the quality of generations from an `LLM` using Prometheus 2.0.

    `PrometheusEval` is a task created for Prometheus 2.0, covering both the absolute and relative
    evaluations. The absolute evaluation i.e. `mode="absolute"` is used to evaluate a single generation from
    an LLM for a given instruction. The relative evaluation i.e. `mode="relative"` is used to evaluate two generations from an LLM
    for a given instruction.
    Both evaluations provide the possibility of using a reference answer to compare with or withoug
    the `reference` attribute, and both are based on a score rubric that critiques the generation/s
    based on the following default aspects: `helpfulness`, `harmlessness`, `honesty`, `factual-validity`,
    and `reasoning`, that can be overridden via `rubrics`, and the selected rubric is set via the attribute
    `rubric`.

    Note:
        The `PrometheusEval` task is better suited and intended to be used with any of the Prometheus 2.0
        models released by Kaist AI, being: https://hugging-face.cn/prometheus-eval/prometheus-7b-v2.0,
        and https://hugging-face.cn/prometheus-eval/prometheus-8x7b-v2.0. The critique assessment formatting
        and quality is not guaranteed if using another model, even though some other models may be able to
        correctly follow the formatting and generate insightful critiques too.

    Attributes:
        mode: the evaluation mode to use, either `absolute` or `relative`. It defines whether the task
            will evaluate one or two generations.
        rubric: the score rubric to use within the prompt to run the critique based on different aspects.
            Can be any existing key in the `rubrics` attribute, which by default means that it can be:
            `helpfulness`, `harmlessness`, `honesty`, `factual-validity`, or `reasoning`. Those will only
            work if using the default `rubrics`, otherwise, the provided `rubrics` should be used.
        rubrics: a dictionary containing the different rubrics to use for the critique, where the keys are
            the rubric names and the values are the rubric descriptions. The default rubrics are the following:
            `helpfulness`, `harmlessness`, `honesty`, `factual-validity`, and `reasoning`.
        reference: a boolean flag to indicate whether a reference answer / completion will be provided, so
            that the model critique is based on the comparison with it. It implies that the column `reference`
            needs to be provided within the input data in addition to the rest of the inputs.
        _template: a Jinja2 template used to format the input for the LLM.

    Input columns:
        - instruction (`str`): The instruction to use as reference.
        - generation (`str`, optional): The generated text from the given `instruction`. This column is required
            if `mode=absolute`.
        - generations (`List[str]`, optional): The generated texts from the given `instruction`. It should
            contain 2 generations only. This column is required if `mode=relative`.
        - reference (`str`, optional): The reference / golden answer for the `instruction`, to be used by the LLM
            for comparison against.

    Output columns:
        - feedback (`str`): The feedback explaining the result below, as critiqued by the LLM using the
            pre-defined score rubric, compared against `reference` if provided.
        - result (`Union[int, Literal["A", "B"]]`): If `mode=absolute`, then the result contains the score for the
            `generation` in a likert-scale from 1-5, otherwise, if `mode=relative`, then the result contains either
            "A" or "B", the "winning" one being the generation in the index 0 of `generations` if `result='A'` or the
            index 1 if `result='B'`.
        - model_name (`str`): The model name used to generate the `feedback` and `result`.

    Categories:
        - critique
        - preference

    References:
        - [Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models](https://arxiv.org/abs/2405.01535)
        - [prometheus-eval: Evaluate your LLM's response with Prometheus 💯](https://github.com/prometheus-eval/prometheus-eval)

    Examples:
        Critique and evaluate LLM generation quality using Prometheus 2_0:

        ```python
        from distilabel.steps.tasks import PrometheusEval
        from distilabel.models import vLLM

        # Consider this as a placeholder for your actual LLM.
        prometheus = PrometheusEval(
            llm=vLLM(
                model="prometheus-eval/prometheus-7b-v2.0",
                chat_template="[INST] {{ messages[0]\"content\" }}\\n{{ messages[1]\"content\" }}[/INST]",
            ),
            mode="absolute",
            rubric="factual-validity"
        )

        prometheus.load()

        result = next(
            prometheus.process(
                [
                    {"instruction": "make something", "generation": "something done"},
                ]
            )
        )
        # result
        # [
        #     {
        #         'instruction': 'make something',
        #         'generation': 'something done',
        #         'model_name': 'prometheus-eval/prometheus-7b-v2.0',
        #         'feedback': 'the feedback',
        #         'result': 6,
        #     }
        # ]
        ```

        Critique for relative evaluation:

        ```python
        from distilabel.steps.tasks import PrometheusEval
        from distilabel.models import vLLM

        # Consider this as a placeholder for your actual LLM.
        prometheus = PrometheusEval(
            llm=vLLM(
                model="prometheus-eval/prometheus-7b-v2.0",
                chat_template="[INST] {{ messages[0]\"content\" }}\\n{{ messages[1]\"content\" }}[/INST]",
            ),
            mode="relative",
            rubric="honesty"
        )

        prometheus.load()

        result = next(
            prometheus.process(
                [
                    {"instruction": "make something", "generations": ["something done", "other thing"]},
                ]
            )
        )
        # result
        # [
        #     {
        #         'instruction': 'make something',
        #         'generations': ['something done', 'other thing'],
        #         'model_name': 'prometheus-eval/prometheus-7b-v2.0',
        #         'feedback': 'the feedback',
        #         'result': 'something done',
        #     }
        # ]
        ```

        Critique with a custom rubric:

        ```python
        from distilabel.steps.tasks import PrometheusEval
        from distilabel.models import vLLM

        # Consider this as a placeholder for your actual LLM.
        prometheus = PrometheusEval(
            llm=vLLM(
                model="prometheus-eval/prometheus-7b-v2.0",
                chat_template="[INST] {{ messages[0]\"content\" }}\\n{{ messages[1]\"content\" }}[/INST]",
            ),
            mode="absolute",
            rubric="custom",
            rubrics={
                "custom": "[A]\\nScore 1: A\\nScore 2: B\\nScore 3: C\\nScore 4: D\\nScore 5: E"
            }
        )

        prometheus.load()

        result = next(
            prometheus.process(
                [
                    {"instruction": "make something", "generation": "something done"},
                ]
            )
        )
        # result
        # [
        #     {
        #         'instruction': 'make something',
        #         'generation': 'something done',
        #         'model_name': 'prometheus-eval/prometheus-7b-v2.0',
        #         'feedback': 'the feedback',
        #         'result': 6,
        #     }
        # ]
        ```

        Critique using a reference answer:

        ```python
        from distilabel.steps.tasks import PrometheusEval
        from distilabel.models import vLLM

        # Consider this as a placeholder for your actual LLM.
        prometheus = PrometheusEval(
            llm=vLLM(
                model="prometheus-eval/prometheus-7b-v2.0",
                chat_template="[INST] {{ messages[0]\"content\" }}\\n{{ messages[1]\"content\" }}[/INST]",
            ),
            mode="absolute",
            rubric="helpfulness",
            reference=True,
        )

        prometheus.load()

        result = next(
            prometheus.process(
                [
                    {
                        "instruction": "make something",
                        "generation": "something done",
                        "reference": "this is a reference answer",
                    },
                ]
            )
        )
        # result
        # [
        #     {
        #         'instruction': 'make something',
        #         'generation': 'something done',
        #         'reference': 'this is a reference answer',
        #         'model_name': 'prometheus-eval/prometheus-7b-v2.0',
        #         'feedback': 'the feedback',
        #         'result': 6,
        #     }
        # ]
        ```

    Citations:
        ```
        @misc{kim2024prometheus2opensource,
            title={Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models},
            author={Seungone Kim and Juyoung Suk and Shayne Longpre and Bill Yuchen Lin and Jamin Shin and Sean Welleck and Graham Neubig and Moontae Lee and Kyungjae Lee and Minjoon Seo},
            year={2024},
            eprint={2405.01535},
            archivePrefix={arXiv},
            primaryClass={cs.CL},
            url={https://arxiv.org/abs/2405.01535},
        }
        ```
    """

    mode: Literal["absolute", "relative"]
    rubric: str
    rubrics: Optional[Dict[str, str]] = Field(default=_DEFAULT_RUBRICS)
    reference: bool = False

    _template: Union[Template, None] = PrivateAttr(...)

    @model_validator(mode="after")
    def validate_rubric_and_rubrics(self) -> Self:
        if not isinstance(self.rubrics, dict) or len(self.rubrics) < 1:
            raise DistilabelUserError(
                "Provided `rubrics` must be a Python dictionary with string keys and string values.",
                page="components-gallery/tasks/prometheuseval/",
            )

        def rubric_matches_pattern(rubric: str) -> bool:
            """Checks if the provided rubric matches the pattern of the default rubrics."""
            pattern = r"^\[.*?\]\n(?:Score [1-4]: .*?\n){4}(?:Score 5: .*?)"
            return bool(re.match(pattern, rubric, re.MULTILINE))

        if not all(rubric_matches_pattern(value) for value in self.rubrics.values()):
            raise DistilabelUserError(
                "Provided rubrics should match the format of the default rubrics, which"
                " is as follows: `[<scoring criteria>]\nScore 1: <description>\nScore 2: <description>\n"
                "Score 3: <description>\nScore 4: <description>\nScore 5: <description>`; replacing"
                " `<scoring criteria>` and `<description>` with the actual criteria and description"
                " for each or the scores, respectively.",
                page="components-gallery/tasks/prometheuseval/",
            )

        if self.rubric not in self.rubrics:
            raise DistilabelUserError(
                f"Provided rubric '{self.rubric}' is not among the available rubrics: {', '.join(self.rubrics.keys())}.",
                page="components-gallery/tasks/prometheuseval/",
            )

        return self

    def load(self) -> None:
        """Loads the Jinja2 template for Prometheus 2.0 either absolute or relative evaluation
        depending on the `mode` value, and either with or without reference, depending on the
        value of `reference`."""
        super().load()

        _path = str(
            importlib_resources.files("distilabel")
            / "steps"
            / "tasks"
            / "templates"
            / "prometheus"
            / (
                f"{self.mode}_without_reference.jinja2"
                if self.reference is False
                else f"{self.mode}_with_reference.jinja2"
            )
        )

        self._template = Template(open(_path).read())

    @property
    def inputs(self) -> List[str]:
        """The default inputs for the task are the `instruction` and the `generation`
        if `reference=False`, otherwise, the inputs are `instruction`, `generation`, and
        `reference`."""
        if self.mode == "absolute":
            if self.reference:
                return ["instruction", "generation", "reference"]
            return ["instruction", "generation"]
        else:
            if self.reference:
                return ["instruction", "generations", "reference"]
            return ["instruction", "generations"]

    def format_input(self, input: Dict[str, Any]) -> "ChatType":
        """The input is formatted as a `ChatType` where the prompt is formatted according
        to the selected Jinja2 template for Prometheus 2.0, assuming that's the first interaction
        from the user, including a pre-defined system prompt."""
        template_kwargs = {
            "instruction": input["instruction"],
            "rubric": self.rubrics[self.rubric],
        }
        if self.reference:
            template_kwargs["reference"] = input["reference"]

        if self.mode == "absolute":
            if not isinstance(input["generation"], str):
                raise DistilabelUserError(
                    f"Provided `generation` is of type {type(input['generation'])} but a string"
                    " should be provided instead.",
                    page="components-gallery/tasks/prometheuseval/",
                )

            template_kwargs["generation"] = input["generation"]
            system_message = (
                "You are a fair judge assistant tasked with providing clear, objective feedback based"
                " on specific criteria, ensuring each assessment reflects the absolute standards set"
                " for performance."
            )
        else:  # self.mode == "relative"
            if (
                not isinstance(input["generations"], list)
                or not all(
                    isinstance(generation, str) for generation in input["generations"]
                )
                or len(input["generations"]) != 2
            ):
                raise DistilabelUserError(
                    f"Provided `generations` is of type {type(input['generations'])} but a list of strings with length 2 should be provided instead.",
                    page="components-gallery/tasks/prometheuseval/",
                )

            template_kwargs["generations"] = input["generations"]
            system_message = (
                "You are a fair judge assistant assigned to deliver insightful feedback that compares"
                " individual performances, highlighting how each stands relative to others within the"
                " same cohort."
            )

        return [
            {
                "role": "system",
                "content": system_message,
            },
            {
                "role": "user",
                "content": self._template.render(**template_kwargs),  # type: ignore
            },
        ]

    @property
    def outputs(self) -> List[str]:
        """The output for the task are the `feedback` and the `result` generated by Prometheus,
        as well as the `model_name` which is automatically included based on the `LLM` used.
        """
        return ["feedback", "result", "model_name"]

    def format_output(
        self, output: Union[str, None], input: Dict[str, Any]
    ) -> Dict[str, Any]:
        """The output is formatted as a dict with the keys `feedback` and `result` captured
        using a regex from the Prometheus output.

        Args:
            output: the raw output of the LLM.
            input: the input to the task. Optionally provided in case it's useful to build the output.

        Returns:
            A dict with the keys `feedback` and `result` generated by the LLM.
        """
        if output is None:
            return {"feedback": None, "result": None}

        parts = output.split("[RESULT]")
        if len(parts) != 2:
            return {"feedback": None, "result": None}

        feedback, result = parts[0].strip(), parts[1].strip()
        if feedback.startswith("Feedback:"):
            feedback = feedback[len("Feedback:") :].strip()
        if self.mode == "absolute":
            if not result.isdigit() or result not in ["1", "2", "3", "4", "5"]:
                return {"feedback": None, "result": None}
            return {"feedback": feedback, "result": int(result)}
        else:  # self.mode == "relative"
            if result not in ["A", "B"]:
                return {"feedback": None, "result": None}
            return {"feedback": feedback, "result": result}
inputs property

任务的默认输入是 instructiongeneration(如果 reference=False),否则,输入是 instructiongenerationreference

outputs property

任务的输出是由 Prometheus 生成的 feedbackresult,以及根据使用的 LLM 自动包含的 model_name

load()

加载 Prometheus 2.0 的 Jinja2 模板,用于绝对或相对评估,具体取决于 mode 值,以及是否使用参考,具体取决于 reference 的值。

源代码位于 src/distilabel/steps/tasks/prometheus_eval.py
def load(self) -> None:
    """Loads the Jinja2 template for Prometheus 2.0 either absolute or relative evaluation
    depending on the `mode` value, and either with or without reference, depending on the
    value of `reference`."""
    super().load()

    _path = str(
        importlib_resources.files("distilabel")
        / "steps"
        / "tasks"
        / "templates"
        / "prometheus"
        / (
            f"{self.mode}_without_reference.jinja2"
            if self.reference is False
            else f"{self.mode}_with_reference.jinja2"
        )
    )

    self._template = Template(open(_path).read())
format_input(input)

输入格式化为 ChatType,其中提示根据为 Prometheus 2.0 选择的 Jinja2 模板进行格式化,假设这是用户的第一次交互,包括预定义的系统提示。

源代码位于 src/distilabel/steps/tasks/prometheus_eval.py
def format_input(self, input: Dict[str, Any]) -> "ChatType":
    """The input is formatted as a `ChatType` where the prompt is formatted according
    to the selected Jinja2 template for Prometheus 2.0, assuming that's the first interaction
    from the user, including a pre-defined system prompt."""
    template_kwargs = {
        "instruction": input["instruction"],
        "rubric": self.rubrics[self.rubric],
    }
    if self.reference:
        template_kwargs["reference"] = input["reference"]

    if self.mode == "absolute":
        if not isinstance(input["generation"], str):
            raise DistilabelUserError(
                f"Provided `generation` is of type {type(input['generation'])} but a string"
                " should be provided instead.",
                page="components-gallery/tasks/prometheuseval/",
            )

        template_kwargs["generation"] = input["generation"]
        system_message = (
            "You are a fair judge assistant tasked with providing clear, objective feedback based"
            " on specific criteria, ensuring each assessment reflects the absolute standards set"
            " for performance."
        )
    else:  # self.mode == "relative"
        if (
            not isinstance(input["generations"], list)
            or not all(
                isinstance(generation, str) for generation in input["generations"]
            )
            or len(input["generations"]) != 2
        ):
            raise DistilabelUserError(
                f"Provided `generations` is of type {type(input['generations'])} but a list of strings with length 2 should be provided instead.",
                page="components-gallery/tasks/prometheuseval/",
            )

        template_kwargs["generations"] = input["generations"]
        system_message = (
            "You are a fair judge assistant assigned to deliver insightful feedback that compares"
            " individual performances, highlighting how each stands relative to others within the"
            " same cohort."
        )

    return [
        {
            "role": "system",
            "content": system_message,
        },
        {
            "role": "user",
            "content": self._template.render(**template_kwargs),  # type: ignore
        },
    ]
format_output(output, input)

输出格式化为字典,其中键 feedbackresult 是使用正则表达式从 Prometheus 输出中捕获的。

参数

名称 类型 描述 默认值
output Union[str, None]

LLM 的原始输出。

必需
input Dict[str, Any]

任务的输入。可选提供,以防对构建输出有用。

必需

返回

类型 描述
Dict[str, Any]

包含由 LLM 生成的键 feedbackresult 的字典。

源代码位于 src/distilabel/steps/tasks/prometheus_eval.py
def format_output(
    self, output: Union[str, None], input: Dict[str, Any]
) -> Dict[str, Any]:
    """The output is formatted as a dict with the keys `feedback` and `result` captured
    using a regex from the Prometheus output.

    Args:
        output: the raw output of the LLM.
        input: the input to the task. Optionally provided in case it's useful to build the output.

    Returns:
        A dict with the keys `feedback` and `result` generated by the LLM.
    """
    if output is None:
        return {"feedback": None, "result": None}

    parts = output.split("[RESULT]")
    if len(parts) != 2:
        return {"feedback": None, "result": None}

    feedback, result = parts[0].strip(), parts[1].strip()
    if feedback.startswith("Feedback:"):
        feedback = feedback[len("Feedback:") :].strip()
    if self.mode == "absolute":
        if not result.isdigit() or result not in ["1", "2", "3", "4", "5"]:
            return {"feedback": None, "result": None}
        return {"feedback": feedback, "result": int(result)}
    else:  # self.mode == "relative"
        if result not in ["A", "B"]:
            return {"feedback": None, "result": None}
        return {"feedback": feedback, "result": result}

QualityScorer

基类:Task

使用 LLM 根据响应质量对其进行评分。

QualityScorer 是一个预定义的任务,它将 instruction 定义为输入,将 score 定义为输出。此任务用于评估指令和响应的质量。它是论文“What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning”中质量评分任务的实现。该任务遵循与 Complexity Scorer 相同的方案,但指令-响应对在质量方面进行评分,从而获得每个指令的质量分数。

属性

名称 类型 描述
_template Union[Template, None]

用于格式化 LLM 输入的 Jinja2 模板。

输入列
  • instruction (str): 用于生成 responses 的指令。
  • responses (List[str]): 要评分的响应。每个响应都与指令形成一对。
输出列
  • scores (List[float]): 每个指令的分数。
  • model_name (str): 用于生成分数的模型名称。
类别
  • 评分器
  • quality
  • response
参考

示例

评估您的指令的质量

from distilabel.steps.tasks import QualityScorer
from distilabel.models import InferenceEndpointsLLM

# Consider this as a placeholder for your actual LLM.
scorer = QualityScorer(
    llm=InferenceEndpointsLLM(
        model_id="mistralai/Mistral-7B-Instruct-v0.2",
    )
)

scorer.load()

result = next(
    scorer.process(
        [
            {
                "instruction": "instruction",
                "responses": ["good response", "weird response", "bad response"]
            }
        ]
    )
)
# result
[
    {
        'instructions': 'instruction',
        'model_name': 'test',
        'scores': [5, 3, 1],
    }
]

使用默认模式生成结构化输出

from distilabel.steps.tasks import QualityScorer
from distilabel.models import InferenceEndpointsLLM

scorer = QualityScorer(
    llm=InferenceEndpointsLLM(
        model_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
    ),
    use_default_structured_output=True
)

scorer.load()

result = next(
    scorer.process(
        [
            {
                "instruction": "instruction",
                "responses": ["good response", "weird response", "bad response"]
            }
        ]
    )
)

# result
[{'instruction': 'instruction',
'responses': ['good response', 'weird response', 'bad response'],
'scores': [1, 2, 3],
'distilabel_metadata': {'raw_output_quality_scorer_0': '{  "scores": [1, 2, 3] }'},
'model_name': 'meta-llama/Meta-Llama-3.1-70B-Instruct'}]
引用
@misc{liu2024makesgooddataalignment,
    title={What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning},
    author={Wei Liu and Weihao Zeng and Keqing He and Yong Jiang and Junxian He},
    year={2024},
    eprint={2312.15685},
    archivePrefix={arXiv},
    primaryClass={cs.CL},
    url={https://arxiv.org/abs/2312.15685},
}
源代码位于 src/distilabel/steps/tasks/quality_scorer.py
class QualityScorer(Task):
    """Score responses based on their quality using an `LLM`.

    `QualityScorer` is a pre-defined task that defines the `instruction` as the input
    and `score` as the output. This task is used to rate the quality of instructions and responses.
    It's an implementation of the quality score task from the paper 'What Makes Good Data
    for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning'.
    The task follows the same scheme as the Complexity Scorer, but the instruction-response pairs
    are scored in terms of quality, obtaining a quality score for each instruction.

    Attributes:
        _template: a Jinja2 template used to format the input for the LLM.

    Input columns:
        - instruction (`str`): The instruction that was used to generate the `responses`.
        - responses (`List[str]`): The responses to be scored. Each response forms a pair with the instruction.

    Output columns:
        - scores (`List[float]`): The score for each instruction.
        - model_name (`str`): The model name used to generate the scores.

    Categories:
        - scorer
        - quality
        - response

    References:
        - [`What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning`](https://arxiv.org/abs/2312.15685)

    Examples:
        Evaluate the quality of your instructions:

        ```python
        from distilabel.steps.tasks import QualityScorer
        from distilabel.models import InferenceEndpointsLLM

        # Consider this as a placeholder for your actual LLM.
        scorer = QualityScorer(
            llm=InferenceEndpointsLLM(
                model_id="mistralai/Mistral-7B-Instruct-v0.2",
            )
        )

        scorer.load()

        result = next(
            scorer.process(
                [
                    {
                        "instruction": "instruction",
                        "responses": ["good response", "weird response", "bad response"]
                    }
                ]
            )
        )
        # result
        [
            {
                'instructions': 'instruction',
                'model_name': 'test',
                'scores': [5, 3, 1],
            }
        ]
        ```

        Generate structured output with default schema:

        ```python
        from distilabel.steps.tasks import QualityScorer
        from distilabel.models import InferenceEndpointsLLM

        scorer = QualityScorer(
            llm=InferenceEndpointsLLM(
                model_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
            ),
            use_default_structured_output=True
        )

        scorer.load()

        result = next(
            scorer.process(
                [
                    {
                        "instruction": "instruction",
                        "responses": ["good response", "weird response", "bad response"]
                    }
                ]
            )
        )

        # result
        [{'instruction': 'instruction',
        'responses': ['good response', 'weird response', 'bad response'],
        'scores': [1, 2, 3],
        'distilabel_metadata': {'raw_output_quality_scorer_0': '{  "scores": [1, 2, 3] }'},
        'model_name': 'meta-llama/Meta-Llama-3.1-70B-Instruct'}]
        ```

    Citations:
        ```
        @misc{liu2024makesgooddataalignment,
            title={What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning},
            author={Wei Liu and Weihao Zeng and Keqing He and Yong Jiang and Junxian He},
            year={2024},
            eprint={2312.15685},
            archivePrefix={arXiv},
            primaryClass={cs.CL},
            url={https://arxiv.org/abs/2312.15685},
        }
        ```
    """

    _template: Union[Template, None] = PrivateAttr(...)
    _can_be_used_with_offline_batch_generation = True

    def load(self) -> None:
        """Loads the Jinja2 template."""
        super().load()

        _path = str(
            importlib_resources.files("distilabel")
            / "steps"
            / "tasks"
            / "templates"
            / "quality-scorer.jinja2"
        )

        self._template = Template(open(_path).read())

    @property
    def inputs(self) -> List[str]:
        """The inputs for the task are `instruction` and `responses`."""
        return ["instruction", "responses"]

    def format_input(self, input: Dict[str, Any]) -> ChatType:  # type: ignore
        """The input is formatted as a `ChatType` assuming that the instruction
        is the first interaction from the user within a conversation."""
        return [
            {
                "role": "user",
                "content": self._template.render(  # type: ignore
                    instruction=input["instruction"], responses=input["responses"]
                ),
            }
        ]

    @property
    def outputs(self):
        """The output for the task is a list of `scores` containing the quality score for each
        response in `responses`."""
        return ["scores", "model_name"]

    def format_output(
        self, output: Union[str, None], input: Dict[str, Any]
    ) -> Dict[str, Any]:
        """The output is formatted as a list with the score of each instruction-response pair.

        Args:
            output: the raw output of the LLM.
            input: the input to the task. Used for obtaining the number of responses.

        Returns:
            A dict with the key `scores` containing the scores for each instruction-response pair.
        """
        if output is None:
            return {"scores": [None] * len(input["responses"])}

        if self.use_default_structured_output:
            return self._format_structured_output(output, input)

        scores = []
        score_lines = output.split("\n")

        for i, line in enumerate(score_lines):
            match = _PARSE_SCORE_LINE_REGEX.match(line)
            score = float(match.group(1)) if match else None
            scores.append(score)
            if i == len(input["responses"]) - 1:
                break
        return {"scores": scores}

    @override
    def get_structured_output(self) -> Dict[str, Any]:
        """Creates the json schema to be passed to the LLM, to enforce generating
        a dictionary with the output which can be directly parsed as a python dictionary.

        The schema corresponds to the following:

        ```python
        from pydantic import BaseModel
        from typing import List

        class SchemaQualityScorer(BaseModel):
            scores: List[int]
        ```

        Returns:
            JSON Schema of the response to enforce.
        """
        return {
            "properties": {
                "scores": {
                    "items": {"type": "integer"},
                    "title": "Scores",
                    "type": "array",
                }
            },
            "required": ["scores"],
            "title": "SchemaQualityScorer",
            "type": "object",
        }

    def _format_structured_output(
        self, output: str, input: Dict[str, Any]
    ) -> Dict[str, str]:
        """Parses the structured response, which should correspond to a dictionary
        with the scores, and a list with them.

        Args:
            output: The output from the `LLM`.

        Returns:
            Formatted output.
        """
        try:
            return orjson.loads(output)
        except orjson.JSONDecodeError:
            return {"scores": [None] * len(input["responses"])}

    @override
    def _sample_input(self) -> ChatType:
        return self.format_input(
            {
                "instruction": f"<PLACEHOLDER_{'instruction'.upper()}>",
                "responses": [
                    f"<PLACEHOLDER_{f'RESPONSE_{i}'.upper()}>" for i in range(2)
                ],
            }
        )
inputs property

任务的输入是 instructionresponses

outputs property

任务的输出是一个 scores 列表,其中包含 responses 中每个响应的质量分数。

load()

加载 Jinja2 模板。

源代码位于 src/distilabel/steps/tasks/quality_scorer.py
def load(self) -> None:
    """Loads the Jinja2 template."""
    super().load()

    _path = str(
        importlib_resources.files("distilabel")
        / "steps"
        / "tasks"
        / "templates"
        / "quality-scorer.jinja2"
    )

    self._template = Template(open(_path).read())
format_input(input)

输入被格式化为 ChatType,假设指令是用户在对话中的首次互动。

源代码位于 src/distilabel/steps/tasks/quality_scorer.py
def format_input(self, input: Dict[str, Any]) -> ChatType:  # type: ignore
    """The input is formatted as a `ChatType` assuming that the instruction
    is the first interaction from the user within a conversation."""
    return [
        {
            "role": "user",
            "content": self._template.render(  # type: ignore
                instruction=input["instruction"], responses=input["responses"]
            ),
        }
    ]
format_output(output, input)

输出被格式化为一个列表,其中包含每个指令-响应对的分数。

参数

名称 类型 描述 默认值
output Union[str, None]

LLM 的原始输出。

必需
input Dict[str, Any]

任务的输入。用于获取响应数量。

必需

返回

类型 描述
Dict[str, Any]

一个字典,键为 scores,其中包含每个指令-响应对的分数。

源代码位于 src/distilabel/steps/tasks/quality_scorer.py
def format_output(
    self, output: Union[str, None], input: Dict[str, Any]
) -> Dict[str, Any]:
    """The output is formatted as a list with the score of each instruction-response pair.

    Args:
        output: the raw output of the LLM.
        input: the input to the task. Used for obtaining the number of responses.

    Returns:
        A dict with the key `scores` containing the scores for each instruction-response pair.
    """
    if output is None:
        return {"scores": [None] * len(input["responses"])}

    if self.use_default_structured_output:
        return self._format_structured_output(output, input)

    scores = []
    score_lines = output.split("\n")

    for i, line in enumerate(score_lines):
        match = _PARSE_SCORE_LINE_REGEX.match(line)
        score = float(match.group(1)) if match else None
        scores.append(score)
        if i == len(input["responses"]) - 1:
            break
    return {"scores": scores}
get_structured_output()

创建要传递给 LLM 的 json 模式,以强制生成一个字典,该字典的输出可以直接解析为 python 字典。

该模式对应于以下内容

from pydantic import BaseModel
from typing import List

class SchemaQualityScorer(BaseModel):
    scores: List[int]

返回

类型 描述
Dict[str, Any]

强制执行的响应的 JSON 模式。

源代码位于 src/distilabel/steps/tasks/quality_scorer.py
@override
def get_structured_output(self) -> Dict[str, Any]:
    """Creates the json schema to be passed to the LLM, to enforce generating
    a dictionary with the output which can be directly parsed as a python dictionary.

    The schema corresponds to the following:

    ```python
    from pydantic import BaseModel
    from typing import List

    class SchemaQualityScorer(BaseModel):
        scores: List[int]
    ```

    Returns:
        JSON Schema of the response to enforce.
    """
    return {
        "properties": {
            "scores": {
                "items": {"type": "integer"},
                "title": "Scores",
                "type": "array",
            }
        },
        "required": ["scores"],
        "title": "SchemaQualityScorer",
        "type": "object",
    }
_format_structured_output(output, input)

解析结构化响应,该响应应对应于包含分数的字典以及包含这些分数的列表。

参数

名称 类型 描述 默认值
output str

来自 LLM 的输出。

必需

返回

类型 描述
Dict[str, str]

格式化后的输出。

源代码位于 src/distilabel/steps/tasks/quality_scorer.py
def _format_structured_output(
    self, output: str, input: Dict[str, Any]
) -> Dict[str, str]:
    """Parses the structured response, which should correspond to a dictionary
    with the scores, and a list with them.

    Args:
        output: The output from the `LLM`.

    Returns:
        Formatted output.
    """
    try:
        return orjson.loads(output)
    except orjson.JSONDecodeError:
        return {"scores": [None] * len(input["responses"])}

SelfInstruct

基类:Task

使用 LLM 基于给定输入生成指令。

SelfInstruct 是一个预定义的任务,它在给定一定数量的指令、查询生成标准、应用程序描述和输入的情况下,生成与给定输入相关并遵循查询生成标准和应用程序描述中声明的内容的指令。它基于论文 "Self-Instruct: Aligning Language Models with Self-Generated Instructions" 中的 SelfInstruct 框架。

属性

名称 类型 描述
num_instructions int

要生成的指令数量。默认为 5。

criteria_for_query_generation str

查询生成的标准。默认为论文中定义的标准。

application_description str

想要使用这些指令构建的 AI 应用程序的描述。默认为 AI 助手

输入列
  • input (str): 用于生成指令的输入。在论文中也称为种子 (seed)。
输出列
  • instructions (List[str]): 生成的指令。
  • model_name (str): 用于生成指令的模型名称。
类别
  • 文本生成
参考

示例

基于给定输入生成指令

from distilabel.steps.tasks import SelfInstruct
from distilabel.models import InferenceEndpointsLLM

self_instruct = SelfInstruct(
    llm=InferenceEndpointsLLM(
        model_id="mistralai/Mistral-7B-Instruct-v0.2",
    ),
    num_instructions=5,  # This is the default value
)

self_instruct.load()

result = next(self_instruct.process([{"input": "instruction"}]))
# result
# [
#     {
#         'input': 'instruction',
#         'model_name': 'mistralai/Mistral-7B-Instruct-v0.2',
#         'instructions': ["instruction 1", "instruction 2", "instruction 3", "instruction 4", "instruction 5"],
#     }
# ]
引用
@misc{wang2023selfinstructaligninglanguagemodels,
    title={Self-Instruct: Aligning Language Models with Self-Generated Instructions},
    author={Yizhong Wang and Yeganeh Kordi and Swaroop Mishra and Alisa Liu and Noah A. Smith and Daniel Khashabi and Hannaneh Hajishirzi},
    year={2023},
    eprint={2212.10560},
    archivePrefix={arXiv},
    primaryClass={cs.CL},
    url={https://arxiv.org/abs/2212.10560},
}
源代码位于 src/distilabel/steps/tasks/self_instruct.py
class SelfInstruct(Task):
    """Generate instructions based on a given input using an `LLM`.

    `SelfInstruct` is a pre-defined task that, given a number of instructions, a
    certain criteria for query generations, an application description, and an input,
    generates a number of instruction related to the given input and following what
    is stated in the criteria for query generation and the application description.
    It is based in the SelfInstruct framework from the paper "Self-Instruct: Aligning
    Language Models with Self-Generated Instructions".

    Attributes:
        num_instructions: The number of instructions to be generated. Defaults to 5.
        criteria_for_query_generation: The criteria for the query generation. Defaults
            to the criteria defined within the paper.
        application_description: The description of the AI application that one want
            to build with these instructions. Defaults to `AI assistant`.

    Input columns:
        - input (`str`): The input to generate the instructions. It's also called seed in
            the paper.

    Output columns:
        - instructions (`List[str]`): The generated instructions.
        - model_name (`str`): The model name used to generate the instructions.

    Categories:
        - text-generation

    Reference:
        - [`Self-Instruct: Aligning Language Models with Self-Generated Instructions`](https://arxiv.org/abs/2212.10560)

    Examples:
        Generate instructions based on a given input:

        ```python
        from distilabel.steps.tasks import SelfInstruct
        from distilabel.models import InferenceEndpointsLLM

        self_instruct = SelfInstruct(
            llm=InferenceEndpointsLLM(
                model_id="mistralai/Mistral-7B-Instruct-v0.2",
            ),
            num_instructions=5,  # This is the default value
        )

        self_instruct.load()

        result = next(self_instruct.process([{"input": "instruction"}]))
        # result
        # [
        #     {
        #         'input': 'instruction',
        #         'model_name': 'mistralai/Mistral-7B-Instruct-v0.2',
        #         'instructions': ["instruction 1", "instruction 2", "instruction 3", "instruction 4", "instruction 5"],
        #     }
        # ]
        ```

    Citations:
        ```
        @misc{wang2023selfinstructaligninglanguagemodels,
            title={Self-Instruct: Aligning Language Models with Self-Generated Instructions},
            author={Yizhong Wang and Yeganeh Kordi and Swaroop Mishra and Alisa Liu and Noah A. Smith and Daniel Khashabi and Hannaneh Hajishirzi},
            year={2023},
            eprint={2212.10560},
            archivePrefix={arXiv},
            primaryClass={cs.CL},
            url={https://arxiv.org/abs/2212.10560},
        }
        ```
    """

    num_instructions: int = 5
    criteria_for_query_generation: str = (
        "Incorporate a diverse range of verbs, avoiding repetition.\n"
        "Ensure queries are compatible with AI model's text generation functions and are limited to 1-2 sentences.\n"
        "Design queries to be self-contained and standalone.\n"
        'Blend interrogative (e.g., "What is the significance of x?") and imperative (e.g., "Detail the process of x.") styles.'
    )
    application_description: str = "AI assistant"

    _template: Union[Template, None] = PrivateAttr(...)
    _can_be_used_with_offline_batch_generation = True

    def load(self) -> None:
        """Loads the Jinja2 template."""
        super().load()

        _path = str(
            importlib_resources.files("distilabel")
            / "steps"
            / "tasks"
            / "templates"
            / "self-instruct.jinja2"
        )

        self._template = Template(open(_path).read())

    @property
    def inputs(self) -> List[str]:
        """The input for the task is the `input` i.e. seed text."""
        return ["input"]

    def format_input(self, input: Dict[str, Any]) -> "ChatType":
        """The input is formatted as a `ChatType` assuming that the instruction
        is the first interaction from the user within a conversation."""
        return [
            {
                "role": "user",
                "content": self._template.render(
                    input=input["input"],
                    application_description=self.application_description,
                    criteria_for_query_generation=self.criteria_for_query_generation,
                    num_instructions=self.num_instructions,
                ),
            }
        ]

    @property
    def outputs(self):
        """The output for the task is a list of `instructions` containing the generated instructions."""
        return ["instructions", "model_name"]

    def format_output(
        self,
        output: Union[str, None],
        input: Optional[Dict[str, Any]] = None,
    ) -> Dict[str, Any]:
        """The output is formatted as a list with the generated instructions.

        Args:
            output: the raw output of the LLM.
            input: the input to the task. Used for obtaining the number of responses.

        Returns:
            A dict with containing the generated instructions.
        """
        if output is None:
            return {"instructions": []}
        return {"instructions": [line for line in output.split("\n") if line != ""]}
inputs property

该任务的输入是 input,即种子文本。

outputs property

该任务的输出是包含生成指令的 instructions 列表。

load()

加载 Jinja2 模板。

源代码位于 src/distilabel/steps/tasks/self_instruct.py
def load(self) -> None:
    """Loads the Jinja2 template."""
    super().load()

    _path = str(
        importlib_resources.files("distilabel")
        / "steps"
        / "tasks"
        / "templates"
        / "self-instruct.jinja2"
    )

    self._template = Template(open(_path).read())
format_input(input)

输入被格式化为 ChatType,假设指令是用户在对话中的首次互动。

源代码位于 src/distilabel/steps/tasks/self_instruct.py
def format_input(self, input: Dict[str, Any]) -> "ChatType":
    """The input is formatted as a `ChatType` assuming that the instruction
    is the first interaction from the user within a conversation."""
    return [
        {
            "role": "user",
            "content": self._template.render(
                input=input["input"],
                application_description=self.application_description,
                criteria_for_query_generation=self.criteria_for_query_generation,
                num_instructions=self.num_instructions,
            ),
        }
    ]
format_output(output, input=None)

输出格式化为包含生成指令的列表。

参数

名称 类型 描述 默认值
output Union[str, None]

LLM 的原始输出。

必需
input Optional[Dict[str, Any]]

任务的输入。用于获取响应数量。

None

返回

类型 描述
Dict[str, Any]

包含生成指令的字典。

源代码位于 src/distilabel/steps/tasks/self_instruct.py
def format_output(
    self,
    output: Union[str, None],
    input: Optional[Dict[str, Any]] = None,
) -> Dict[str, Any]:
    """The output is formatted as a list with the generated instructions.

    Args:
        output: the raw output of the LLM.
        input: the input to the task. Used for obtaining the number of responses.

    Returns:
        A dict with containing the generated instructions.
    """
    if output is None:
        return {"instructions": []}
    return {"instructions": [line for line in output.split("\n") if line != ""]}

GenerateSentencePair

基类:Task

给定锚句子,生成正面和负面(可选)句子。

GenerateSentencePair 是一个预定义的任务,它在给定锚句子的情况下,生成与锚句子相关的正面句子,以及可选的与锚句子无关或与之相似的负面句子。可选地,您可以提供上下文来引导 LLM 朝向更具体的行为。此任务对于生成训练数据集以训练嵌入模型非常有用。

属性

名称 类型 描述
triplet bool

一个标志,指示任务是否应生成三元组句子(锚句子、正面句子、负面句子)。默认为 False

action GenerationAction

执行的操作以生成正面句子。

context str

用于生成的上下文。可以帮助引导 LLM 朝向更具体的上下文。默认情况下不使用。

hard_negative bool

一个标志,指示负面句子是否应为硬负例。硬负例使模型难以区分正面句子,具有更高的语义相似度。

输入列
  • anchor (str): 用于生成正面和负面句子的锚句子。
输出列
  • positive (str): 与 anchor 相关的正面句子。
  • negative (str): 如果 triplet=True,则为与 anchor 无关的负面句子;或者,如果 hard_negative=True,则为与正面句子更相似的负面句子,以增加模型区分的难度。
  • model_name (str): 用于生成句子的模型名称。
类别
  • embedding

示例

释义

from distilabel.steps.tasks import GenerateSentencePair
from distilabel.models import InferenceEndpointsLLM

generate_sentence_pair = GenerateSentencePair(
    triplet=True, # `False` to generate only positive
    action="paraphrase",
    llm=InferenceEndpointsLLM(
        model_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
        tokenizer_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
    ),
    input_batch_size=10,
)

generate_sentence_pair.load()

result = generate_sentence_pair.process([{"anchor": "What Game of Thrones villain would be the most likely to give you mercy?"}])

生成语义相似的句子

from distilabel.models import InferenceEndpointsLLM
from distilabel.steps.tasks import GenerateSentencePair

generate_sentence_pair = GenerateSentencePair(
    triplet=True, # `False` to generate only positive
    action="semantically-similar",
    llm=InferenceEndpointsLLM(
        model_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
        tokenizer_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
    ),
    input_batch_size=10,
)

generate_sentence_pair.load()

result = generate_sentence_pair.process([{"anchor": "How does 3D printing work?"}])

生成查询

from distilabel.steps.tasks import GenerateSentencePair
from distilabel.models import InferenceEndpointsLLM

generate_sentence_pair = GenerateSentencePair(
    triplet=True, # `False` to generate only positive
    action="query",
    llm=InferenceEndpointsLLM(
        model_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
        tokenizer_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
    ),
    input_batch_size=10,
)

generate_sentence_pair.load()

result = generate_sentence_pair.process([{"anchor": "Argilla is an open-source data curation platform for LLMs. Using Argilla, ..."}])

生成答案

from distilabel.steps.tasks import GenerateSentencePair
from distilabel.models import InferenceEndpointsLLM

generate_sentence_pair = GenerateSentencePair(
    triplet=True, # `False` to generate only positive
    action="answer",
    llm=InferenceEndpointsLLM(
        model_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
        tokenizer_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
    ),
    input_batch_size=10,
)

generate_sentence_pair.load()

result = generate_sentence_pair.process([{"anchor": "What Game of Thrones villain would be the most likely to give you mercy?"}])

生成带上下文的查询(适用于每个操作

from distilabel.steps.tasks import GenerateSentencePair
from distilabel.models import InferenceEndpointsLLM

generate_sentence_pair = GenerateSentencePair(
    triplet=True, # `False` to generate only positive
    action="query",
    context="Argilla is an open-source data curation platform for LLMs.",
    llm=InferenceEndpointsLLM(
        model_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
        tokenizer_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
    ),
    input_batch_size=10,
)

generate_sentence_pair.load()

result = generate_sentence_pair.process([{"anchor": "I want to generate queries for my LLM."}])

生成硬负例(适用于每个操作

from distilabel.steps.tasks import GenerateSentencePair
from distilabel.models import InferenceEndpointsLLM

generate_sentence_pair = GenerateSentencePair(
    triplet=True, # `False` to generate only positive
    action="query",
    context="Argilla is an open-source data curation platform for LLMs.",
    hard_negative=True,
    llm=InferenceEndpointsLLM(
        model_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
        tokenizer_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
    ),
    input_batch_size=10,
)

generate_sentence_pair.load()

result = generate_sentence_pair.process([{"anchor": "I want to generate queries for my LLM."}])

使用默认模式生成结构化数据(适用于每个操作

from distilabel.steps.tasks import GenerateSentencePair
from distilabel.models import InferenceEndpointsLLM

generate_sentence_pair = GenerateSentencePair(
    triplet=True, # `False` to generate only positive
    action="query",
    context="Argilla is an open-source data curation platform for LLMs.",
    hard_negative=True,
    llm=InferenceEndpointsLLM(
        model_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
    ),
    input_batch_size=10,
    use_default_structured_output=True
)

generate_sentence_pair.load()

result = generate_sentence_pair.process([{"anchor": "I want to generate queries for my LLM."}])
源代码位于 src/distilabel/steps/tasks/sentence_transformers.py
class GenerateSentencePair(Task):
    """Generate a positive and negative (optionally) sentences given an anchor sentence.

    `GenerateSentencePair` is a pre-defined task that given an anchor sentence generates
    a positive sentence related to the anchor and optionally a negative sentence unrelated
    to the anchor or similar to it. Optionally, you can give a context to guide the LLM
    towards more specific behavior. This task is useful to generate training datasets for
    training embeddings models.

    Attributes:
        triplet: a flag to indicate if the task should generate a triplet of sentences
            (anchor, positive, negative). Defaults to `False`.
        action: the action to perform to generate the positive sentence.
        context: the context to use for the generation. Can be helpful to guide the LLM
            towards more specific context. Not used by default.
        hard_negative: A flag to indicate if the negative should be a hard-negative or not.
            Hard negatives make it hard for the model to distinguish against the positive,
            with a higher degree of semantic similarity.

    Input columns:
        - anchor (`str`): The anchor sentence to generate the positive and negative sentences.

    Output columns:
        - positive (`str`): The positive sentence related to the `anchor`.
        - negative (`str`): The negative sentence unrelated to the `anchor` if `triplet=True`,
            or more similar to the positive to make it more challenging for a model to distinguish
            in case `hard_negative=True`.
        - model_name (`str`): The name of the model that was used to generate the sentences.

    Categories:
        - embedding

    Examples:
        Paraphrasing:

        ```python
        from distilabel.steps.tasks import GenerateSentencePair
        from distilabel.models import InferenceEndpointsLLM

        generate_sentence_pair = GenerateSentencePair(
            triplet=True, # `False` to generate only positive
            action="paraphrase",
            llm=InferenceEndpointsLLM(
                model_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
                tokenizer_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
            ),
            input_batch_size=10,
        )

        generate_sentence_pair.load()

        result = generate_sentence_pair.process([{"anchor": "What Game of Thrones villain would be the most likely to give you mercy?"}])
        ```

        Generating semantically similar sentences:

        ```python
        from distilabel.models import InferenceEndpointsLLM
        from distilabel.steps.tasks import GenerateSentencePair

        generate_sentence_pair = GenerateSentencePair(
            triplet=True, # `False` to generate only positive
            action="semantically-similar",
            llm=InferenceEndpointsLLM(
                model_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
                tokenizer_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
            ),
            input_batch_size=10,
        )

        generate_sentence_pair.load()

        result = generate_sentence_pair.process([{"anchor": "How does 3D printing work?"}])
        ```

        Generating queries:

        ```python
        from distilabel.steps.tasks import GenerateSentencePair
        from distilabel.models import InferenceEndpointsLLM

        generate_sentence_pair = GenerateSentencePair(
            triplet=True, # `False` to generate only positive
            action="query",
            llm=InferenceEndpointsLLM(
                model_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
                tokenizer_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
            ),
            input_batch_size=10,
        )

        generate_sentence_pair.load()

        result = generate_sentence_pair.process([{"anchor": "Argilla is an open-source data curation platform for LLMs. Using Argilla, ..."}])
        ```

        Generating answers:

        ```python
        from distilabel.steps.tasks import GenerateSentencePair
        from distilabel.models import InferenceEndpointsLLM

        generate_sentence_pair = GenerateSentencePair(
            triplet=True, # `False` to generate only positive
            action="answer",
            llm=InferenceEndpointsLLM(
                model_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
                tokenizer_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
            ),
            input_batch_size=10,
        )

        generate_sentence_pair.load()

        result = generate_sentence_pair.process([{"anchor": "What Game of Thrones villain would be the most likely to give you mercy?"}])
        ```

        Generating queries with context (**applies to every action**):

        ```python
        from distilabel.steps.tasks import GenerateSentencePair
        from distilabel.models import InferenceEndpointsLLM

        generate_sentence_pair = GenerateSentencePair(
            triplet=True, # `False` to generate only positive
            action="query",
            context="Argilla is an open-source data curation platform for LLMs.",
            llm=InferenceEndpointsLLM(
                model_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
                tokenizer_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
            ),
            input_batch_size=10,
        )

        generate_sentence_pair.load()

        result = generate_sentence_pair.process([{"anchor": "I want to generate queries for my LLM."}])
        ```

        Generating Hard-negatives (**applies to every action**):

        ```python
        from distilabel.steps.tasks import GenerateSentencePair
        from distilabel.models import InferenceEndpointsLLM

        generate_sentence_pair = GenerateSentencePair(
            triplet=True, # `False` to generate only positive
            action="query",
            context="Argilla is an open-source data curation platform for LLMs.",
            hard_negative=True,
            llm=InferenceEndpointsLLM(
                model_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
                tokenizer_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
            ),
            input_batch_size=10,
        )

        generate_sentence_pair.load()

        result = generate_sentence_pair.process([{"anchor": "I want to generate queries for my LLM."}])
        ```

        Generating structured data with default schema (**applies to every action**):

        ```python
        from distilabel.steps.tasks import GenerateSentencePair
        from distilabel.models import InferenceEndpointsLLM

        generate_sentence_pair = GenerateSentencePair(
            triplet=True, # `False` to generate only positive
            action="query",
            context="Argilla is an open-source data curation platform for LLMs.",
            hard_negative=True,
            llm=InferenceEndpointsLLM(
                model_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
            ),
            input_batch_size=10,
            use_default_structured_output=True
        )

        generate_sentence_pair.load()

        result = generate_sentence_pair.process([{"anchor": "I want to generate queries for my LLM."}])
        ```
    """

    triplet: bool = False
    action: GenerationAction
    hard_negative: bool = False
    context: str = ""

    def load(self) -> None:
        """Loads the Jinja2 template."""
        super().load()

        _path = str(
            importlib_resources.files("distilabel")
            / "steps"
            / "tasks"
            / "templates"
            / "generate-sentence-pair.jinja2"
        )

        self._template = Template(open(_path).read())

    @property
    def inputs(self) -> List[str]:
        """The inputs for the task is the `anchor` sentence."""
        return ["anchor"]

    def format_input(self, input: Dict[str, Any]) -> "ChatType":
        """The inputs are formatted as a `ChatType`, with a system prompt describing the
        task of generating a positive and negative sentences for the anchor sentence. The
        anchor is provided as the first user interaction in the conversation.

        Args:
            input: The input containing the `anchor` sentence.

        Returns:
            A list of dictionaries containing the system and user interactions.
        """
        action_sentence = GENERATION_ACTION_SENTENCES[self.action]

        format_system_prompt = {
            "action_sentence": action_sentence,
            "context": CONTEXT_INTRO if self.context else "",
        }
        if self.triplet:
            format_system_prompt["negative_style"] = NEGATIVE_STYLE[
                "hard-negative" if self.hard_negative else "negative"
            ]

        system_prompt = (
            POSITIVE_NEGATIVE_SYSTEM_PROMPT if self.triplet else POSITIVE_SYSTEM_PROMPT
        ).format(**format_system_prompt)

        return [
            {"role": "system", "content": system_prompt},
            {
                "role": "user",
                "content": self._template.render(
                    anchor=input["anchor"],
                    context=self.context if self.context else None,
                ),
            },
        ]

    @property
    def outputs(self) -> List[str]:
        """The outputs for the task are the `positive` and `negative` sentences, as well
        as the `model_name` used to generate the sentences."""
        columns = ["positive", "negative"] if self.triplet else ["positive"]
        columns += ["model_name"]
        return columns

    def format_output(
        self, output: Union[str, None], input: Optional[Dict[str, Any]] = None
    ) -> Dict[str, Any]:
        """Formats the output of the LLM, to extract the `positive` and `negative` sentences
        generated. If the output is `None` or the regex doesn't match, then the outputs
        will be set to `None` as well.

        Args:
            output: The output of the LLM.
            input: The input used to generate the output.

        Returns:
            The formatted output containing the `positive` and `negative` sentences.
        """
        if output is None:
            return {"positive": None, "negative": None}

        if self.use_default_structured_output:
            return self._format_structured_output(output)

        match = POSITIVE_NEGATIVE_PAIR_REGEX.search(output)
        if match is None:
            formatted_output = {"positive": None}
            if self.triplet:
                formatted_output["negative"] = None
            return formatted_output

        groups = match.groups()
        if self.triplet:
            return {
                "positive": groups[0].strip(),
                "negative": (
                    groups[1].strip()
                    if len(groups) > 1 and groups[1] is not None
                    else None
                ),
            }

        return {"positive": groups[0].strip()}

    @override
    def get_structured_output(self) -> Dict[str, Any]:
        """Creates the json schema to be passed to the LLM, to enforce generating
        a dictionary with the output which can be directly parsed as a python dictionary.

        Returns:
            JSON Schema of the response to enforce.
        """
        if self.triplet:
            return {
                "properties": {
                    "positive": {"title": "Positive", "type": "string"},
                    "negative": {"title": "Negative", "type": "string"},
                },
                "required": ["positive", "negative"],
                "title": "Schema",
                "type": "object",
            }
        return {
            "properties": {"positive": {"title": "Positive", "type": "string"}},
            "required": ["positive"],
            "title": "Schema",
            "type": "object",
        }

    def _format_structured_output(self, output: str) -> Dict[str, str]:
        """Parses the structured response, which should correspond to a dictionary
        with either `positive`, or `positive` and `negative` keys.

        Args:
            output: The output from the `LLM`.

        Returns:
            Formatted output.
        """
        try:
            return orjson.loads(output)
        except orjson.JSONDecodeError:
            if self.triplet:
                return {"positive": None, "negative": None}
            return {"positive": None}
inputs property

该任务的输入是 anchor 句子。

outputs property

该任务的输出是 positivenegative 句子,以及用于生成句子的 model_name

load()

加载 Jinja2 模板。

源代码位于 src/distilabel/steps/tasks/sentence_transformers.py
def load(self) -> None:
    """Loads the Jinja2 template."""
    super().load()

    _path = str(
        importlib_resources.files("distilabel")
        / "steps"
        / "tasks"
        / "templates"
        / "generate-sentence-pair.jinja2"
    )

    self._template = Template(open(_path).read())
format_input(input)

输入格式化为 ChatType,其中包含描述为锚句子生成正面和负面句子的任务的系统提示。锚句子作为对话中的第一个用户交互提供。

参数

名称 类型 描述 默认值
input Dict[str, Any]

包含 anchor 句子的输入。

必需

返回

类型 描述
ChatType

包含系统和用户交互的字典列表。

源代码位于 src/distilabel/steps/tasks/sentence_transformers.py
def format_input(self, input: Dict[str, Any]) -> "ChatType":
    """The inputs are formatted as a `ChatType`, with a system prompt describing the
    task of generating a positive and negative sentences for the anchor sentence. The
    anchor is provided as the first user interaction in the conversation.

    Args:
        input: The input containing the `anchor` sentence.

    Returns:
        A list of dictionaries containing the system and user interactions.
    """
    action_sentence = GENERATION_ACTION_SENTENCES[self.action]

    format_system_prompt = {
        "action_sentence": action_sentence,
        "context": CONTEXT_INTRO if self.context else "",
    }
    if self.triplet:
        format_system_prompt["negative_style"] = NEGATIVE_STYLE[
            "hard-negative" if self.hard_negative else "negative"
        ]

    system_prompt = (
        POSITIVE_NEGATIVE_SYSTEM_PROMPT if self.triplet else POSITIVE_SYSTEM_PROMPT
    ).format(**format_system_prompt)

    return [
        {"role": "system", "content": system_prompt},
        {
            "role": "user",
            "content": self._template.render(
                anchor=input["anchor"],
                context=self.context if self.context else None,
            ),
        },
    ]
format_output(output, input=None)

格式化 LLM 的输出,以提取生成的 positivenegative 句子。如果输出为 None 或正则表达式不匹配,则输出也将设置为 None

参数

名称 类型 描述 默认值
output Union[str, None]

LLM 的输出。

必需
input Optional[Dict[str, Any]]

用于生成输出的输入。

None

返回

类型 描述
Dict[str, Any]

包含 positivenegative 句子的格式化输出。

源代码位于 src/distilabel/steps/tasks/sentence_transformers.py
def format_output(
    self, output: Union[str, None], input: Optional[Dict[str, Any]] = None
) -> Dict[str, Any]:
    """Formats the output of the LLM, to extract the `positive` and `negative` sentences
    generated. If the output is `None` or the regex doesn't match, then the outputs
    will be set to `None` as well.

    Args:
        output: The output of the LLM.
        input: The input used to generate the output.

    Returns:
        The formatted output containing the `positive` and `negative` sentences.
    """
    if output is None:
        return {"positive": None, "negative": None}

    if self.use_default_structured_output:
        return self._format_structured_output(output)

    match = POSITIVE_NEGATIVE_PAIR_REGEX.search(output)
    if match is None:
        formatted_output = {"positive": None}
        if self.triplet:
            formatted_output["negative"] = None
        return formatted_output

    groups = match.groups()
    if self.triplet:
        return {
            "positive": groups[0].strip(),
            "negative": (
                groups[1].strip()
                if len(groups) > 1 and groups[1] is not None
                else None
            ),
        }

    return {"positive": groups[0].strip()}
get_structured_output()

创建要传递给 LLM 的 json 模式,以强制生成一个字典,该字典的输出可以直接解析为 python 字典。

返回

类型 描述
Dict[str, Any]

强制执行的响应的 JSON 模式。

源代码位于 src/distilabel/steps/tasks/sentence_transformers.py
@override
def get_structured_output(self) -> Dict[str, Any]:
    """Creates the json schema to be passed to the LLM, to enforce generating
    a dictionary with the output which can be directly parsed as a python dictionary.

    Returns:
        JSON Schema of the response to enforce.
    """
    if self.triplet:
        return {
            "properties": {
                "positive": {"title": "Positive", "type": "string"},
                "negative": {"title": "Negative", "type": "string"},
            },
            "required": ["positive", "negative"],
            "title": "Schema",
            "type": "object",
        }
    return {
        "properties": {"positive": {"title": "Positive", "type": "string"}},
        "required": ["positive"],
        "title": "Schema",
        "type": "object",
    }
_format_structured_output(output)

解析结构化响应,该响应应对应于一个字典,其中包含 positivepositivenegative 键。

参数

名称 类型 描述 默认值
output str

来自 LLM 的输出。

必需

返回

类型 描述
Dict[str, str]

格式化后的输出。

源代码位于 src/distilabel/steps/tasks/sentence_transformers.py
def _format_structured_output(self, output: str) -> Dict[str, str]:
    """Parses the structured response, which should correspond to a dictionary
    with either `positive`, or `positive` and `negative` keys.

    Args:
        output: The output from the `LLM`.

    Returns:
        Formatted output.
    """
    try:
        return orjson.loads(output)
    except orjson.JSONDecodeError:
        if self.triplet:
            return {"positive": None, "negative": None}
        return {"positive": None}

StructuredGeneration

基类:Task

使用 LLM 为给定的 instruction 生成结构化内容。

StructuredGeneration 是一个预定义的任务,它将 instructionstructured_output 定义为输入,并将 generation 定义为输出。此任务用于根据输入指令并遵循每个 instructionstructured_output 列中提供的模式生成结构化内容。model_name 也作为输出的一部分返回,以增强输出。

属性

名称 类型 描述
use_system_prompt bool

是否在生成中使用系统提示。默认为 True,这意味着如果在输入批次中定义了 system_prompt 列,则将使用 system_prompt,否则将被忽略。

输入列
  • instruction (str): 用于生成结构化内容的指令。
  • structured_output (Dict[str, Any]): 用于生成结构化内容的 structured_output。它应该是一个 Python 字典,其中包含键 formatschema,其中 format 应该是 jsonregex 之一,schema 应该是 JSON 模式或正则表达式模式。
输出列
  • generation (str): 生成的文本,如果可能,应与提供的模式匹配。
  • model_name (str): 用于生成文本的模型名称。
类别
  • outlines
  • structured-generation

示例

从 JSON 模式生成结构化输出

from distilabel.steps.tasks import StructuredGeneration
from distilabel.models import InferenceEndpointsLLM

structured_gen = StructuredGeneration(
    llm=InferenceEndpointsLLM(
        model_id="meta-llama/Meta-Llama-3-70B-Instruct",
        tokenizer_id="meta-llama/Meta-Llama-3-70B-Instruct",
    ),
)

structured_gen.load()

result = next(
    structured_gen.process(
        [
            {
                "instruction": "Create an RPG character",
                "structured_output": {
                    "format": "json",
                    "schema": {
                        "properties": {
                            "name": {
                                "title": "Name",
                                "type": "string"
                            },
                            "description": {
                                "title": "Description",
                                "type": "string"
                            },
                            "role": {
                                "title": "Role",
                                "type": "string"
                            },
                            "weapon": {
                                "title": "Weapon",
                                "type": "string"
                            }
                        },
                        "required": [
                            "name",
                            "description",
                            "role",
                            "weapon"
                        ],
                        "title": "Character",
                        "type": "object"
                    }
                },
            }
        ]
    )
)

从正则表达式模式生成结构化输出(仅适用于支持正则表达式的 LLM,使用 outlines 的提供程序)

from distilabel.steps.tasks import StructuredGeneration
from distilabel.models import InferenceEndpointsLLM

structured_gen = StructuredGeneration(
    llm=InferenceEndpointsLLM(
        model_id="meta-llama/Meta-Llama-3-70B-Instruct",
        tokenizer_id="meta-llama/Meta-Llama-3-70B-Instruct",
    ),
)

structured_gen.load()

result = next(
    structured_gen.process(
        [
            {
                "instruction": "What's the weather like today in Seattle in Celsius degrees?",
                "structured_output": {
                    "format": "regex",
                    "schema": r"(\d{1,2})°C"
                },

            }
        ]
    )
)
源代码位于 src/distilabel/steps/tasks/structured_generation.py
class StructuredGeneration(Task):
    """Generate structured content for a given `instruction` using an `LLM`.

    `StructuredGeneration` is a pre-defined task that defines the `instruction` and the `structured_output`
    as the inputs, and `generation` as the output. This task is used to generate structured content based on
    the input instruction and following the schema provided within the `structured_output` column per each
    `instruction`. The `model_name` also returned as part of the output in order to enhance it.

    Attributes:
        use_system_prompt: Whether to use the system prompt in the generation. Defaults to `True`,
            which means that if the column `system_prompt` is  defined within the input batch, then
            the `system_prompt` will be used, otherwise, it will be ignored.

    Input columns:
        - instruction (`str`): The instruction to generate structured content from.
        - structured_output (`Dict[str, Any]`): The structured_output to generate structured content from. It should be a
            Python dictionary with the keys `format` and `schema`, where `format` should be one of `json` or
            `regex`, and the `schema` should be either the JSON schema or the regex pattern, respectively.

    Output columns:
        - generation (`str`): The generated text matching the provided schema, if possible.
        - model_name (`str`): The name of the model used to generate the text.

    Categories:
        - outlines
        - structured-generation

    Examples:
        Generate structured output from a JSON schema:

        ```python
        from distilabel.steps.tasks import StructuredGeneration
        from distilabel.models import InferenceEndpointsLLM

        structured_gen = StructuredGeneration(
            llm=InferenceEndpointsLLM(
                model_id="meta-llama/Meta-Llama-3-70B-Instruct",
                tokenizer_id="meta-llama/Meta-Llama-3-70B-Instruct",
            ),
        )

        structured_gen.load()

        result = next(
            structured_gen.process(
                [
                    {
                        "instruction": "Create an RPG character",
                        "structured_output": {
                            "format": "json",
                            "schema": {
                                "properties": {
                                    "name": {
                                        "title": "Name",
                                        "type": "string"
                                    },
                                    "description": {
                                        "title": "Description",
                                        "type": "string"
                                    },
                                    "role": {
                                        "title": "Role",
                                        "type": "string"
                                    },
                                    "weapon": {
                                        "title": "Weapon",
                                        "type": "string"
                                    }
                                },
                                "required": [
                                    "name",
                                    "description",
                                    "role",
                                    "weapon"
                                ],
                                "title": "Character",
                                "type": "object"
                            }
                        },
                    }
                ]
            )
        )
        ```

        Generate structured output from a regex pattern (only works with LLMs that support regex, the providers using outlines):

        ```python
        from distilabel.steps.tasks import StructuredGeneration
        from distilabel.models import InferenceEndpointsLLM

        structured_gen = StructuredGeneration(
            llm=InferenceEndpointsLLM(
                model_id="meta-llama/Meta-Llama-3-70B-Instruct",
                tokenizer_id="meta-llama/Meta-Llama-3-70B-Instruct",
            ),
        )

        structured_gen.load()

        result = next(
            structured_gen.process(
                [
                    {
                        "instruction": "What's the weather like today in Seattle in Celsius degrees?",
                        "structured_output": {
                            "format": "regex",
                            "schema": r"(\\d{1,2})°C"
                        },

                    }
                ]
            )
        )
        ```
    """

    use_system_prompt: bool = False

    @property
    def inputs(self) -> List[str]:
        """The input for the task are the `instruction` and the `structured_output`.
        Optionally, if the `use_system_prompt` flag is set to True, then the
        `system_prompt` will be used too."""
        columns = ["instruction", "structured_output"]
        if self.use_system_prompt:
            columns = ["system_prompt"] + columns
        return columns

    def format_input(self, input: Dict[str, Any]) -> StructuredInput:
        """The input is formatted as a `ChatType` assuming that the instruction
        is the first interaction from the user within a conversation."""
        if not isinstance(input["instruction"], str):
            raise DistilabelUserError(
                f"Input `instruction` must be a string. Got: {input['instruction']}.",
                page="components-gallery/tasks/structuredgeneration/",
            )

        messages = [{"role": "user", "content": input["instruction"]}]
        if self.use_system_prompt:
            if "system_prompt" in input:
                messages.insert(
                    0, {"role": "system", "content": input["system_prompt"]}
                )
            else:
                warnings.warn(
                    "`use_system_prompt` is set to `True`, but no `system_prompt` in input batch, so it will be ignored.",
                    UserWarning,
                    stacklevel=2,
                )

        return (messages, input.get("structured_output", None))  # type: ignore

    @property
    def outputs(self) -> List[str]:
        """The output for the task is the `generation` and the `model_name`."""
        return ["generation", "model_name"]

    def format_output(
        self, output: Union[str, None], input: Dict[str, Any]
    ) -> Dict[str, Any]:
        """The output is formatted as a dictionary with the `generation`. The `model_name`
        will be automatically included within the `process` method of `Task`. Note that even
        if the `structured_output` is defined to produce a JSON schema, this method will return the raw
        output i.e. a string without any parsing."""
        return {"generation": output}
inputs property

该任务的输入是 instructionstructured_output。可选地,如果 use_system_prompt 标志设置为 True,则也会使用 system_prompt

outputs property

该任务的输出是 generationmodel_name

format_input(input)

输入被格式化为 ChatType,假设指令是用户在对话中的首次互动。

源代码位于 src/distilabel/steps/tasks/structured_generation.py
def format_input(self, input: Dict[str, Any]) -> StructuredInput:
    """The input is formatted as a `ChatType` assuming that the instruction
    is the first interaction from the user within a conversation."""
    if not isinstance(input["instruction"], str):
        raise DistilabelUserError(
            f"Input `instruction` must be a string. Got: {input['instruction']}.",
            page="components-gallery/tasks/structuredgeneration/",
        )

    messages = [{"role": "user", "content": input["instruction"]}]
    if self.use_system_prompt:
        if "system_prompt" in input:
            messages.insert(
                0, {"role": "system", "content": input["system_prompt"]}
            )
        else:
            warnings.warn(
                "`use_system_prompt` is set to `True`, but no `system_prompt` in input batch, so it will be ignored.",
                UserWarning,
                stacklevel=2,
            )

    return (messages, input.get("structured_output", None))  # type: ignore
format_output(output, input)

输出格式化为包含 generation 的字典。model_name 将自动包含在 Taskprocess 方法中。请注意,即使 structured_output 被定义为生成 JSON 模式,此方法也将返回原始输出,即不带任何解析的字符串。

源代码位于 src/distilabel/steps/tasks/structured_generation.py
def format_output(
    self, output: Union[str, None], input: Dict[str, Any]
) -> Dict[str, Any]:
    """The output is formatted as a dictionary with the `generation`. The `model_name`
    will be automatically included within the `process` method of `Task`. Note that even
    if the `structured_output` is defined to produce a JSON schema, this method will return the raw
    output i.e. a string without any parsing."""
    return {"generation": output}

TextClassification

基类: Task

将文本分类为一个或多个类别或标签。

此任务可用于文本分类问题,其中目标是将一个或多个标签分配给给定的文本。默认情况下,它使用参考论文中的结构化生成,这有助于生成更简洁的标签。请参阅参考资料中的第 4.1 节。

输入列
  • text (str): 我们要获取标签的参考文本。
输出列
  • labels (Union[str, List[str]]): 文本的标签或标签列表。
  • model_name (str): 用于生成标签的模型名称。
类别
  • 文本分类
参考

属性

名称 类型 描述
system_prompt Optional[str]

在任务开始之前向用户显示的提示。包含默认消息,使模型表现得像分类专家。

n PositiveInt

要生成的标签数量。如果只需要 1 个,则对应于标签分类问题;如果 >1,则将返回文本最具代表性的 “n” 个标签。默认为 1。

context Optional[str]

生成标签时使用的上下文。默认情况下包含通用消息,但可用于自定义任务的上下文。

examples Optional[List[str]]

帮助模型理解任务的示例列表,少量样本。

available_labels Optional[Union[List[str], Dict[str, str]]]

分类文本时可供选择的可用标签列表,或包含标签及其描述的字典。

default_label Optional[Union[str, List[str]]]

当文本含糊不清或缺乏足够的分类信息时使用的默认标签。在多个标签的情况下可以是列表 (n>1)。

示例

为文本分配情感

from distilabel.steps.tasks import TextClassification
from distilabel.models import InferenceEndpointsLLM

llm = InferenceEndpointsLLM(
    model_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
    tokenizer_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
)

text_classification = TextClassification(
    llm=llm,
    context="You are an AI system specialized in assigning sentiment to movies.",
    available_labels=["positive", "negative"],
)

text_classification.load()

result = next(
    text_classification.process(
        [{"text": "This was a masterpiece. Not completely faithful to the books, but enthralling from beginning to end. Might be my favorite of the three."}]
    )
)
# result
# [{'text': 'This was a masterpiece. Not completely faithful to the books, but enthralling from beginning to end. Might be my favorite of the three.',
# 'labels': 'positive',
# 'distilabel_metadata': {'raw_output_text_classification_0': '{\n    "labels": "positive"\n}',
# 'raw_input_text_classification_0': [{'role': 'system',
#     'content': 'You are an AI system specialized in generating labels to classify pieces of text. Your sole purpose is to analyze the given text and provide appropriate classification labels.'},
#     {'role': 'user',
#     'content': '# Instruction\nPlease classify the user query by assigning the most appropriate labels.\nDo not explain your reasoning or provide any additional commentary.\nIf the text is ambiguous or lacks sufficient information for classification, respond with "Unclassified".\nProvide the label that best describes the text.\nYou are an AI system specialized in assigning sentiment to movie the user queries.\n## Labeling the user input\nUse the available labels to classify the user query. Analyze the context of each label specifically:\navailable_labels = [\n    "positive",  # The text shows positive sentiment\n    "negative",  # The text shows negative sentiment\n]\n\n\n## User Query\n```\nThis was a masterpiece. Not completely faithful to the books, but enthralling from beginning to end. Might be my favorite of the three.\n```\n\n## Output Format\nNow, please give me the labels in JSON format, do not include any other text in your response:\n```\n{\n    "labels": "label"\n}\n```'}]},
# 'model_name': 'meta-llama/Meta-Llama-3.1-70B-Instruct'}]

使用指定的描述分配预定义的标签

from distilabel.steps.tasks import TextClassification

text_classification = TextClassification(
    llm=llm,
    n=1,
    context="Determine the intent of the text.",
    available_labels={
        "complaint": "A statement expressing dissatisfaction or annoyance about a product, service, or experience. It's a negative expression of discontent, often with the intention of seeking a resolution or compensation.",
        "inquiry": "A question or request for information about a product, service, or situation. It's a neutral or curious expression seeking clarification or details.",
        "feedback": "A statement providing evaluation, opinion, or suggestion about a product, service, or experience. It can be positive, negative, or neutral, and is often intended to help improve or inform.",
        "praise": "A statement expressing admiration, approval, or appreciation for a product, service, or experience. It's a positive expression of satisfaction or delight, often with the intention of encouraging or recommending."
    },
    query_title="Customer Query",
)

text_classification.load()

result = next(
    text_classification.process(
        [{"text": "Can you tell me more about your return policy?"}]
    )
)
# result
# [{'text': 'Can you tell me more about your return policy?',
# 'labels': 'inquiry',
# 'distilabel_metadata': {'raw_output_text_classification_0': '{\n    "labels": "inquiry"\n}',
# 'raw_input_text_classification_0': [{'role': 'system',
#     'content': 'You are an AI system specialized in generating labels to classify pieces of text. Your sole purpose is to analyze the given text and provide appropriate classification labels.'},
#     {'role': 'user',
#     'content': '# Instruction\nPlease classify the customer query by assigning the most appropriate labels.\nDo not explain your reasoning or provide any additional commentary.\nIf the text is ambiguous or lacks sufficient information for classification, respond with "Unclassified".\nProvide the label that best describes the text.\nDetermine the intent of the text.\n## Labeling the user input\nUse the available labels to classify the user query. Analyze the context of each label specifically:\navailable_labels = [\n    "complaint",  # A statement expressing dissatisfaction or annoyance about a product, service, or experience. It\'s a negative expression of discontent, often with the intention of seeking a resolution or compensation.\n    "inquiry",  # A question or request for information about a product, service, or situation. It\'s a neutral or curious expression seeking clarification or details.\n    "feedback",  # A statement providing evaluation, opinion, or suggestion about a product, service, or experience. It can be positive, negative, or neutral, and is often intended to help improve or inform.\n    "praise",  # A statement expressing admiration, approval, or appreciation for a product, service, or experience. It\'s a positive expression of satisfaction or delight, often with the intention of encouraging or recommending.\n]\n\n\n## Customer Query\n```\nCan you tell me more about your return policy?\n```\n\n## Output Format\nNow, please give me the labels in JSON format, do not include any other text in your response:\n```\n{\n    "labels": "label"\n}\n```'}]},
# 'model_name': 'meta-llama/Meta-Llama-3.1-70B-Instruct'}]

没有预定义标签的自由多标签分类

from distilabel.steps.tasks import TextClassification

text_classification = TextClassification(
    llm=llm,
    n=3,
    context=(
        "Describe the main themes, topics, or categories that could describe the "
        "following type of persona."
    ),
    query_title="Example of Persona",
)

text_classification.load()

result = next(
    text_classification.process(
        [{"text": "A historian or curator of Mexican-American history and culture focused on the cultural, social, and historical impact of the Mexican presence in the United States."}]
    )
)
# result
# [{'text': 'A historian or curator of Mexican-American history and culture focused on the cultural, social, and historical impact of the Mexican presence in the United States.',
# 'labels': ['Historical Researcher',
# 'Cultural Specialist',
# 'Ethnic Studies Expert'],
# 'distilabel_metadata': {'raw_output_text_classification_0': '{\n    "labels": ["Historical Researcher", "Cultural Specialist", "Ethnic Studies Expert"]\n}',
# 'raw_input_text_classification_0': [{'role': 'system',
#     'content': 'You are an AI system specialized in generating labels to classify pieces of text. Your sole purpose is to analyze the given text and provide appropriate classification labels.'},
#     {'role': 'user',
#     'content': '# Instruction\nPlease classify the example of persona by assigning the most appropriate labels.\nDo not explain your reasoning or provide any additional commentary.\nIf the text is ambiguous or lacks sufficient information for classification, respond with "Unclassified".\nProvide a list of 3 labels that best describe the text.\nDescribe the main themes, topics, or categories that could describe the following type of persona.\nUse clear, widely understood terms for labels.Avoid overly specific or obscure labels unless the text demands it.\n\n\n## Example of Persona\n```\nA historian or curator of Mexican-American history and culture focused on the cultural, social, and historical impact of the Mexican presence in the United States.\n```\n\n## Output Format\nNow, please give me the labels in JSON format, do not include any other text in your response:\n```\n{\n    "labels": ["label_0", "label_1", "label_2"]\n}\n```'}]},
# 'model_name': 'meta-llama/Meta-Llama-3.1-70B-Instruct'}]
源代码位于 src/distilabel/steps/tasks/text_classification.py
class TextClassification(Task):
    r"""Classifies text into one or more categories or labels.

    This task can be used for text classification problems, where the goal is to assign
    one or multiple labels to a given text.
    It uses structured generation as per the reference paper by default,
    it can help to generate more concise labels. See section 4.1 in the reference.

    Input columns:
        - text (`str`): The reference text we want to obtain labels for.

    Output columns:
        - labels (`Union[str, List[str]]`): The label or list of labels for the text.
        - model_name (`str`): The name of the model used to generate the label/s.

    Categories:
        - text-classification

    References:
        - [`Let Me Speak Freely? A Study on the Impact of Format Restrictions on Performance of Large Language Models`](https://arxiv.org/abs/2408.02442)

    Attributes:
        system_prompt: A prompt to display to the user before the task starts. Contains a default
            message to make the model behave like a classifier specialist.
        n: Number of labels to generate If only 1 is required, corresponds to a label
            classification problem, if >1 it will intend return the "n" labels most representative
            for the text. Defaults to 1.
        context: Context to use when generating the labels. By default contains a generic message,
            but can be used to customize the context for the task.
        examples: List of examples to help the model understand the task, few shots.
        available_labels: List of available labels to choose from when classifying the text, or
            a dictionary with the labels and their descriptions.
        default_label: Default label to use when the text is ambiguous or lacks sufficient information for
            classification. Can be a list in case of multiple labels (n>1).

    Examples:
        Assigning a sentiment to a text:

        ```python
        from distilabel.steps.tasks import TextClassification
        from distilabel.models import InferenceEndpointsLLM

        llm = InferenceEndpointsLLM(
            model_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
            tokenizer_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
        )

        text_classification = TextClassification(
            llm=llm,
            context="You are an AI system specialized in assigning sentiment to movies.",
            available_labels=["positive", "negative"],
        )

        text_classification.load()

        result = next(
            text_classification.process(
                [{"text": "This was a masterpiece. Not completely faithful to the books, but enthralling from beginning to end. Might be my favorite of the three."}]
            )
        )
        # result
        # [{'text': 'This was a masterpiece. Not completely faithful to the books, but enthralling from beginning to end. Might be my favorite of the three.',
        # 'labels': 'positive',
        # 'distilabel_metadata': {'raw_output_text_classification_0': '{\n    "labels": "positive"\n}',
        # 'raw_input_text_classification_0': [{'role': 'system',
        #     'content': 'You are an AI system specialized in generating labels to classify pieces of text. Your sole purpose is to analyze the given text and provide appropriate classification labels.'},
        #     {'role': 'user',
        #     'content': '# Instruction\nPlease classify the user query by assigning the most appropriate labels.\nDo not explain your reasoning or provide any additional commentary.\nIf the text is ambiguous or lacks sufficient information for classification, respond with "Unclassified".\nProvide the label that best describes the text.\nYou are an AI system specialized in assigning sentiment to movie the user queries.\n## Labeling the user input\nUse the available labels to classify the user query. Analyze the context of each label specifically:\navailable_labels = [\n    "positive",  # The text shows positive sentiment\n    "negative",  # The text shows negative sentiment\n]\n\n\n## User Query\n```\nThis was a masterpiece. Not completely faithful to the books, but enthralling from beginning to end. Might be my favorite of the three.\n```\n\n## Output Format\nNow, please give me the labels in JSON format, do not include any other text in your response:\n```\n{\n    "labels": "label"\n}\n```'}]},
        # 'model_name': 'meta-llama/Meta-Llama-3.1-70B-Instruct'}]
        ```

        Assigning predefined labels with specified descriptions:

        ```python
        from distilabel.steps.tasks import TextClassification

        text_classification = TextClassification(
            llm=llm,
            n=1,
            context="Determine the intent of the text.",
            available_labels={
                "complaint": "A statement expressing dissatisfaction or annoyance about a product, service, or experience. It's a negative expression of discontent, often with the intention of seeking a resolution or compensation.",
                "inquiry": "A question or request for information about a product, service, or situation. It's a neutral or curious expression seeking clarification or details.",
                "feedback": "A statement providing evaluation, opinion, or suggestion about a product, service, or experience. It can be positive, negative, or neutral, and is often intended to help improve or inform.",
                "praise": "A statement expressing admiration, approval, or appreciation for a product, service, or experience. It's a positive expression of satisfaction or delight, often with the intention of encouraging or recommending."
            },
            query_title="Customer Query",
        )

        text_classification.load()

        result = next(
            text_classification.process(
                [{"text": "Can you tell me more about your return policy?"}]
            )
        )
        # result
        # [{'text': 'Can you tell me more about your return policy?',
        # 'labels': 'inquiry',
        # 'distilabel_metadata': {'raw_output_text_classification_0': '{\n    "labels": "inquiry"\n}',
        # 'raw_input_text_classification_0': [{'role': 'system',
        #     'content': 'You are an AI system specialized in generating labels to classify pieces of text. Your sole purpose is to analyze the given text and provide appropriate classification labels.'},
        #     {'role': 'user',
        #     'content': '# Instruction\nPlease classify the customer query by assigning the most appropriate labels.\nDo not explain your reasoning or provide any additional commentary.\nIf the text is ambiguous or lacks sufficient information for classification, respond with "Unclassified".\nProvide the label that best describes the text.\nDetermine the intent of the text.\n## Labeling the user input\nUse the available labels to classify the user query. Analyze the context of each label specifically:\navailable_labels = [\n    "complaint",  # A statement expressing dissatisfaction or annoyance about a product, service, or experience. It\'s a negative expression of discontent, often with the intention of seeking a resolution or compensation.\n    "inquiry",  # A question or request for information about a product, service, or situation. It\'s a neutral or curious expression seeking clarification or details.\n    "feedback",  # A statement providing evaluation, opinion, or suggestion about a product, service, or experience. It can be positive, negative, or neutral, and is often intended to help improve or inform.\n    "praise",  # A statement expressing admiration, approval, or appreciation for a product, service, or experience. It\'s a positive expression of satisfaction or delight, often with the intention of encouraging or recommending.\n]\n\n\n## Customer Query\n```\nCan you tell me more about your return policy?\n```\n\n## Output Format\nNow, please give me the labels in JSON format, do not include any other text in your response:\n```\n{\n    "labels": "label"\n}\n```'}]},
        # 'model_name': 'meta-llama/Meta-Llama-3.1-70B-Instruct'}]
        ```

        Free multi label classification without predefined labels:

        ```python
        from distilabel.steps.tasks import TextClassification

        text_classification = TextClassification(
            llm=llm,
            n=3,
            context=(
                "Describe the main themes, topics, or categories that could describe the "
                "following type of persona."
            ),
            query_title="Example of Persona",
        )

        text_classification.load()

        result = next(
            text_classification.process(
                [{"text": "A historian or curator of Mexican-American history and culture focused on the cultural, social, and historical impact of the Mexican presence in the United States."}]
            )
        )
        # result
        # [{'text': 'A historian or curator of Mexican-American history and culture focused on the cultural, social, and historical impact of the Mexican presence in the United States.',
        # 'labels': ['Historical Researcher',
        # 'Cultural Specialist',
        # 'Ethnic Studies Expert'],
        # 'distilabel_metadata': {'raw_output_text_classification_0': '{\n    "labels": ["Historical Researcher", "Cultural Specialist", "Ethnic Studies Expert"]\n}',
        # 'raw_input_text_classification_0': [{'role': 'system',
        #     'content': 'You are an AI system specialized in generating labels to classify pieces of text. Your sole purpose is to analyze the given text and provide appropriate classification labels.'},
        #     {'role': 'user',
        #     'content': '# Instruction\nPlease classify the example of persona by assigning the most appropriate labels.\nDo not explain your reasoning or provide any additional commentary.\nIf the text is ambiguous or lacks sufficient information for classification, respond with "Unclassified".\nProvide a list of 3 labels that best describe the text.\nDescribe the main themes, topics, or categories that could describe the following type of persona.\nUse clear, widely understood terms for labels.Avoid overly specific or obscure labels unless the text demands it.\n\n\n## Example of Persona\n```\nA historian or curator of Mexican-American history and culture focused on the cultural, social, and historical impact of the Mexican presence in the United States.\n```\n\n## Output Format\nNow, please give me the labels in JSON format, do not include any other text in your response:\n```\n{\n    "labels": ["label_0", "label_1", "label_2"]\n}\n```'}]},
        # 'model_name': 'meta-llama/Meta-Llama-3.1-70B-Instruct'}]
        ```
    """

    system_prompt: Optional[str] = (
        "You are an AI system specialized in generating labels to classify pieces of text. "
        "Your sole purpose is to analyze the given text and provide appropriate classification labels."
    )
    n: PositiveInt = Field(
        default=1,
        description="Number of labels to generate. Defaults to 1.",
    )
    context: Optional[str] = Field(
        default="Generate concise, relevant labels that accurately represent the text's main themes, topics, or categories.",
        description="Context to use when generating the labels.",
    )
    examples: Optional[List[str]] = Field(
        default=None,
        description="List of examples to help the model understand the task, few shots.",
    )
    available_labels: Optional[Union[List[str], Dict[str, str]]] = Field(
        default=None,
        description=(
            "List of available labels to choose from when classifying the text, or "
            "a dictionary with the labels and their descriptions."
        ),
    )
    default_label: Optional[Union[str, List[str]]] = Field(
        default="Unclassified",
        description=(
            "Default label to use when the text is ambiguous or lacks sufficient information for "
            "classification. Can be a list in case of multiple labels (n>1)."
        ),
    )
    query_title: str = Field(
        default="User Query",
        description="Title of the query used to show the example/s to classify.",
    )
    use_default_structured_output: bool = True

    _template: Optional[Template] = PrivateAttr(default=None)

    def load(self) -> None:
        super().load()
        self._template = Template(TEXT_CLASSIFICATION_TEMPLATE)
        self._labels_format: str = (
            '"label"'
            if self.n == 1
            else "[" + ", ".join([f'"label_{i}"' for i in range(self.n)]) + "]"
        )
        self._labels_message: str = (
            "Provide the label that best describes the text."
            if self.n == 1
            else f"Provide a list of {self.n} labels that best describe the text."
        )
        self._available_labels_message: str = self._get_available_labels_message()
        self._examples: str = self._get_examples_message()

    def _get_available_labels_message(self) -> str:
        """Prepares the message to display depending on the available labels (if any),
        and whether the labels have a specific context.
        """
        if self.available_labels is None:
            return (
                "Use clear, widely understood terms for labels."
                "Avoid overly specific or obscure labels unless the text demands it."
            )

        msg = (
            "## Labeling the user input\n"
            "Use the available labels to classify the user query{label_context}:\n"
            "available_labels = {available_labels}"
        )
        if isinstance(self.available_labels, list):
            specific_msg = (
                "[\n"
                + indent(
                    "".join([f'"{label}",\n' for label in self.available_labels]),
                    prefix=" " * 4,
                )
                + "]"
            )
            return msg.format(label_context="", available_labels=specific_msg)

        elif isinstance(self.available_labels, dict):
            specific_msg = ""
            for label, description in self.available_labels.items():
                specific_msg += indent(
                    f'"{label}",  # {description}' + "\n", prefix=" " * 4
                )

            specific_msg = "[\n" + specific_msg + "]"
            return msg.format(
                label_context=". Analyze the context of each label specifically",
                available_labels=specific_msg,
            )

    def _get_examples_message(self) -> str:
        """Prepares the message to display depending on the examples provided."""
        if self.examples is None:
            return ""

        examples_msg = "\n".join([f"- {ex}" for ex in self.examples])

        return (
            "\n## Examples\n"
            "Here are some examples to help you understand the task:\n"
            f"{examples_msg}"
        )

    @property
    def inputs(self) -> List[str]:
        """The input for the task is the `instruction`."""
        return ["text"]

    @property
    def outputs(self) -> List[str]:
        """The output for the task is the `generation` and the `model_name`."""
        return ["labels", "model_name"]

    def format_input(self, input: Dict[str, Any]) -> "ChatType":
        """The input is formatted as a `ChatType` assuming that the instruction
        is the first interaction from the user within a conversation."""
        messages = [
            {
                "role": "user",
                "content": self._template.render(  # type: ignore
                    context=f"\n{self.context}",
                    labels_message=self._labels_message,
                    available_labels=self._available_labels_message,
                    examples=self._examples,
                    default_label=self.default_label,
                    labels_format=self._labels_format,
                    query_title=self.query_title,
                    text=input["text"],
                ),
            },
        ]
        if self.system_prompt:
            messages.insert(0, {"role": "system", "content": self.system_prompt})
        return messages

    def format_output(
        self, output: Union[str, None], input: Union[Dict[str, Any], None] = None
    ) -> Dict[str, Any]:
        """The output is formatted as a dictionary with the `generation`. The `model_name`
        will be automatically included within the `process` method of `Task`."""
        return self._format_structured_output(output)

    @override
    def get_structured_output(self) -> Dict[str, Any]:
        """Creates the json schema to be passed to the LLM, to enforce generating
        a dictionary with the output which can be directly parsed as a python dictionary.

        Returns:
            JSON Schema of the response to enforce.
        """
        if self.n > 1:

            class MultiLabelSchema(BaseModel):
                labels: List[str]

            return MultiLabelSchema.model_json_schema()

        class SingleLabelSchema(BaseModel):
            labels: str

        return SingleLabelSchema.model_json_schema()

    def _format_structured_output(
        self, output: str
    ) -> Dict[str, Union[str, List[str]]]:
        """Parses the structured response, which should correspond to a dictionary
        with the `labels`, and either a string or a list of strings with the labels.

        Args:
            output: The output from the `LLM`.

        Returns:
            Formatted output.
        """
        try:
            return orjson.loads(output)
        except orjson.JSONDecodeError:
            if self.n > 1:
                return {"labels": [None for _ in range(self.n)]}
            return {"labels": None}
inputs property

任务的输入是 instruction

outputs property

该任务的输出是 generationmodel_name

_get_available_labels_message()

准备要显示的消息,具体取决于可用标签(如果有)以及标签是否具有特定上下文。

源代码位于 src/distilabel/steps/tasks/text_classification.py
def _get_available_labels_message(self) -> str:
    """Prepares the message to display depending on the available labels (if any),
    and whether the labels have a specific context.
    """
    if self.available_labels is None:
        return (
            "Use clear, widely understood terms for labels."
            "Avoid overly specific or obscure labels unless the text demands it."
        )

    msg = (
        "## Labeling the user input\n"
        "Use the available labels to classify the user query{label_context}:\n"
        "available_labels = {available_labels}"
    )
    if isinstance(self.available_labels, list):
        specific_msg = (
            "[\n"
            + indent(
                "".join([f'"{label}",\n' for label in self.available_labels]),
                prefix=" " * 4,
            )
            + "]"
        )
        return msg.format(label_context="", available_labels=specific_msg)

    elif isinstance(self.available_labels, dict):
        specific_msg = ""
        for label, description in self.available_labels.items():
            specific_msg += indent(
                f'"{label}",  # {description}' + "\n", prefix=" " * 4
            )

        specific_msg = "[\n" + specific_msg + "]"
        return msg.format(
            label_context=". Analyze the context of each label specifically",
            available_labels=specific_msg,
        )
_get_examples_message()

准备要显示的消息,具体取决于提供的示例。

源代码位于 src/distilabel/steps/tasks/text_classification.py
def _get_examples_message(self) -> str:
    """Prepares the message to display depending on the examples provided."""
    if self.examples is None:
        return ""

    examples_msg = "\n".join([f"- {ex}" for ex in self.examples])

    return (
        "\n## Examples\n"
        "Here are some examples to help you understand the task:\n"
        f"{examples_msg}"
    )
format_input(input)

输入被格式化为 ChatType,假设指令是用户在对话中的首次互动。

源代码位于 src/distilabel/steps/tasks/text_classification.py
def format_input(self, input: Dict[str, Any]) -> "ChatType":
    """The input is formatted as a `ChatType` assuming that the instruction
    is the first interaction from the user within a conversation."""
    messages = [
        {
            "role": "user",
            "content": self._template.render(  # type: ignore
                context=f"\n{self.context}",
                labels_message=self._labels_message,
                available_labels=self._available_labels_message,
                examples=self._examples,
                default_label=self.default_label,
                labels_format=self._labels_format,
                query_title=self.query_title,
                text=input["text"],
            ),
        },
    ]
    if self.system_prompt:
        messages.insert(0, {"role": "system", "content": self.system_prompt})
    return messages
format_output(output, input=None)

输出格式化为包含 generation 的字典。model_name 将自动包含在 Taskprocess 方法中。

源代码位于 src/distilabel/steps/tasks/text_classification.py
def format_output(
    self, output: Union[str, None], input: Union[Dict[str, Any], None] = None
) -> Dict[str, Any]:
    """The output is formatted as a dictionary with the `generation`. The `model_name`
    will be automatically included within the `process` method of `Task`."""
    return self._format_structured_output(output)
get_structured_output()

创建要传递给 LLM 的 json 模式,以强制生成一个字典,该字典的输出可以直接解析为 python 字典。

返回

类型 描述
Dict[str, Any]

强制执行的响应的 JSON 模式。

源代码位于 src/distilabel/steps/tasks/text_classification.py
@override
def get_structured_output(self) -> Dict[str, Any]:
    """Creates the json schema to be passed to the LLM, to enforce generating
    a dictionary with the output which can be directly parsed as a python dictionary.

    Returns:
        JSON Schema of the response to enforce.
    """
    if self.n > 1:

        class MultiLabelSchema(BaseModel):
            labels: List[str]

        return MultiLabelSchema.model_json_schema()

    class SingleLabelSchema(BaseModel):
        labels: str

    return SingleLabelSchema.model_json_schema()
_format_structured_output(output)

解析结构化响应,该响应应对应于包含 labels 的字典,以及包含标签的字符串或字符串列表。

参数

名称 类型 描述 默认值
output str

来自 LLM 的输出。

必需

返回

类型 描述
Dict[str, Union[str, List[str]]]

格式化后的输出。

源代码位于 src/distilabel/steps/tasks/text_classification.py
def _format_structured_output(
    self, output: str
) -> Dict[str, Union[str, List[str]]]:
    """Parses the structured response, which should correspond to a dictionary
    with the `labels`, and either a string or a list of strings with the labels.

    Args:
        output: The output from the `LLM`.

    Returns:
        Formatted output.
    """
    try:
        return orjson.loads(output)
    except orjson.JSONDecodeError:
        if self.n > 1:
            return {"labels": [None for _ in range(self.n)]}
        return {"labels": None}

ChatGeneration

基类:Task

根据对话生成文本。

ChatGeneration 是一个预定义的任务,它将 messages 定义为输入,并将 generation 定义为输出。此任务用于根据对话生成文本。model_name 也作为输出的一部分返回,以增强输出。

输入列
  • messages (List[Dict[Literal["role", "content"], str]]): 用于生成后续补全的消息。
输出列
  • generation (str): 助手生成的文本。
  • model_name (str): 用于生成文本的模型名称。
类别
  • chat-generation
图标

:material-chat

示例

从 OpenAI 聊天格式的对话生成文本

from distilabel.steps.tasks import ChatGeneration
from distilabel.models import InferenceEndpointsLLM

# Consider this as a placeholder for your actual LLM.
chat = ChatGeneration(
    llm=InferenceEndpointsLLM(
        model_id="mistralai/Mistral-7B-Instruct-v0.2",
    )
)

chat.load()

result = next(
    chat.process(
        [
            {
                "messages": [
                    {"role": "user", "content": "How much is 2+2?"},
                ]
            }
        ]
    )
)
# result
# [
#     {
#         'messages': [{'role': 'user', 'content': 'How much is 2+2?'}],
#         'model_name': 'mistralai/Mistral-7B-Instruct-v0.2',
#         'generation': '4',
#     }
# ]
源代码位于 src/distilabel/steps/tasks/text_generation.py
class ChatGeneration(Task):
    """Generates text based on a conversation.

    `ChatGeneration` is a pre-defined task that defines the `messages` as the input
    and `generation` as the output. This task is used to generate text based on a conversation.
    The `model_name` is also returned as part of the output in order to enhance it.

    Input columns:
        - messages (`List[Dict[Literal["role", "content"], str]]`): The messages to generate the
            follow up completion from.

    Output columns:
        - generation (`str`): The generated text from the assistant.
        - model_name (`str`): The model name used to generate the text.

    Categories:
        - chat-generation

    Icon:
        `:material-chat:`

    Examples:
        Generate text from a conversation in OpenAI chat format:

        ```python
        from distilabel.steps.tasks import ChatGeneration
        from distilabel.models import InferenceEndpointsLLM

        # Consider this as a placeholder for your actual LLM.
        chat = ChatGeneration(
            llm=InferenceEndpointsLLM(
                model_id="mistralai/Mistral-7B-Instruct-v0.2",
            )
        )

        chat.load()

        result = next(
            chat.process(
                [
                    {
                        "messages": [
                            {"role": "user", "content": "How much is 2+2?"},
                        ]
                    }
                ]
            )
        )
        # result
        # [
        #     {
        #         'messages': [{'role': 'user', 'content': 'How much is 2+2?'}],
        #         'model_name': 'mistralai/Mistral-7B-Instruct-v0.2',
        #         'generation': '4',
        #     }
        # ]
        ```
    """

    @property
    def inputs(self) -> List[str]:
        """The input for the task are the `messages`."""
        return ["messages"]

    def format_input(self, input: Dict[str, Any]) -> "ChatType":
        """The input is formatted as a `ChatType` assuming that the messages provided
        are already formatted that way i.e. following the OpenAI chat format."""

        if not is_openai_format(input["messages"]):
            raise DistilabelUserError(
                "Input `messages` must be an OpenAI chat-like format conversation. "
                f"Got: {input['messages']}. Please check: 'https://openaicookbook.cn/examples/how_to_format_inputs_to_chatgpt_models'.",
                page="components-gallery/tasks/chatgeneration/",
            )

        if input["messages"][-1]["role"] != "user":
            raise DistilabelUserError(
                "The last message must be from the user. Please check: "
                "'https://openaicookbook.cn/examples/how_to_format_inputs_to_chatgpt_models'.",
                page="components-gallery/tasks/chatgeneration/",
            )

        return input["messages"]

    @property
    def outputs(self) -> List[str]:
        """The output for the task is the `generation` and the `model_name`."""
        return ["generation", "model_name"]

    def format_output(
        self, output: Union[str, None], input: Union[Dict[str, Any], None] = None
    ) -> Dict[str, Any]:
        """The output is formatted as a dictionary with the `generation`. The `model_name`
        will be automatically included within the `process` method of `Task`."""
        return {"generation": output}
inputs property

该任务的输入是 messages

outputs property

该任务的输出是 generationmodel_name

format_input(input)

输入格式化为 ChatType,假设提供的消息已经以这种方式格式化,即遵循 OpenAI 聊天格式。

源代码位于 src/distilabel/steps/tasks/text_generation.py
def format_input(self, input: Dict[str, Any]) -> "ChatType":
    """The input is formatted as a `ChatType` assuming that the messages provided
    are already formatted that way i.e. following the OpenAI chat format."""

    if not is_openai_format(input["messages"]):
        raise DistilabelUserError(
            "Input `messages` must be an OpenAI chat-like format conversation. "
            f"Got: {input['messages']}. Please check: 'https://openaicookbook.cn/examples/how_to_format_inputs_to_chatgpt_models'.",
            page="components-gallery/tasks/chatgeneration/",
        )

    if input["messages"][-1]["role"] != "user":
        raise DistilabelUserError(
            "The last message must be from the user. Please check: "
            "'https://openaicookbook.cn/examples/how_to_format_inputs_to_chatgpt_models'.",
            page="components-gallery/tasks/chatgeneration/",
        )

    return input["messages"]
format_output(output, input=None)

输出格式化为包含 generation 的字典。model_name 将自动包含在 Taskprocess 方法中。

源代码位于 src/distilabel/steps/tasks/text_generation.py
def format_output(
    self, output: Union[str, None], input: Union[Dict[str, Any], None] = None
) -> Dict[str, Any]:
    """The output is formatted as a dictionary with the `generation`. The `model_name`
    will be automatically included within the `process` method of `Task`."""
    return {"generation": output}

TextGeneration

基类:Task

使用 LLM 根据提示生成文本。

TextGeneration 是一个预定义的任务,允许使用 Jinja2 语法传递自定义提示。默认情况下,输入中需要 instruction,但使用 templatecolumns 属性可以定义自定义提示和文本中预期的列。此任务应足以满足不需要对 LLM 生成的响应进行后处理的任务。

属性

名称 类型 描述
system_prompt Union[str, None]

在生成中使用的系统提示。如果未提供,则将检查输入行是否有名为 system_prompt 的列并使用它。否则,将不使用系统提示。默认为 None

template str

用于生成的模板。它必须遵循 Jinja2 模板语法。如果未提供,它将假定传递的文本是指令并构造适当的模板。

columns Union[str, List[str]]

columns

use_system_prompt bool

DEPRECATED。将在 1.5.0 版本中移除。是否在生成中使用系统提示。默认为 True,这意味着如果在输入批次中定义了 system_prompt 列,则将使用 system_prompt,否则将被忽略。

输入列
  • 输入 (dynamic (由 columns 属性确定)): 默认情况下将设置为 instruction。列可以指向要在模板中使用的 strList[str]
输出列
  • generation (str): 生成的文本。
  • model_name (str): 用于生成文本的模型名称。
类别
  • 文本生成
参考

示例

从指令生成文本

from distilabel.steps.tasks import TextGeneration
from distilabel.models import InferenceEndpointsLLM

# Consider this as a placeholder for your actual LLM.
text_gen = TextGeneration(
    llm=InferenceEndpointsLLM(
        model_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
    )
)

text_gen.load()

result = next(
    text_gen.process(
        [{"instruction": "your instruction"}]
    )
)
# result
# [
#     {
#         'instruction': 'your instruction',
#         'model_name': 'meta-llama/Meta-Llama-3.1-70B-Instruct',
#         'generation': 'generation',
#     }
# ]

使用自定义模板生成文本

from distilabel.steps.tasks import TextGeneration
from distilabel.models import InferenceEndpointsLLM

CUSTOM_TEMPLATE = '''Document:
{{ document }}

Question: {{ question }}

Please provide a clear and concise answer to the question based on the information in the document and your general knowledge:
'''.rstrip()

text_gen = TextGeneration(
    llm=InferenceEndpointsLLM(
        model_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
    ),
    system_prompt="You are a helpful AI assistant. Your task is to answer the following question based on the provided document. If the answer is not explicitly stated in the document, use your knowledge to provide the most relevant and accurate answer possible. If you cannot answer the question based on the given information, state that clearly.",
    template=CUSTOM_TEMPLATE,
    columns=["document", "question"],
)

text_gen.load()

result = next(
    text_gen.process(
        [
            {
                "document": "The Great Barrier Reef, located off the coast of Australia, is the world's largest coral reef system. It stretches over 2,300 kilometers and is home to a diverse array of marine life, including over 1,500 species of fish. However, in recent years, the reef has faced significant challenges due to climate change, with rising sea temperatures causing coral bleaching events.",
                "question": "What is the main threat to the Great Barrier Reef mentioned in the document?"
            }
        ]
    )
)
# result
# [
#     {
#         'document': 'The Great Barrier Reef, located off the coast of Australia, is the world's largest coral reef system. It stretches over 2,300 kilometers and is home to a diverse array of marine life, including over 1,500 species of fish. However, in recent years, the reef has faced significant challenges due to climate change, with rising sea temperatures causing coral bleaching events.',
#         'question': 'What is the main threat to the Great Barrier Reef mentioned in the document?',
#         'model_name': 'meta-llama/Meta-Llama-3.1-70B-Instruct',
#         'generation': 'According to the document, the main threat to the Great Barrier Reef is climate change, specifically rising sea temperatures causing coral bleaching events.',
#     }
# ]

使用不同系统提示的少量样本学习

from distilabel.steps.tasks import TextGeneration
from distilabel.models import InferenceEndpointsLLM

CUSTOM_TEMPLATE = '''Generate a clear, single-sentence instruction based on the following examples:

{% for example in examples %}
Example {{ loop.index }}:
Instruction: {{ example }}

{% endfor %}
Now, generate a new instruction in a similar style:
'''.rstrip()

text_gen = TextGeneration(
    llm=InferenceEndpointsLLM(
        model_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
    ),
    template=CUSTOM_TEMPLATE,
    columns="examples",
)

text_gen.load()

result = next(
    text_gen.process(
        [
            {
                "examples": ["This is an example", "Another relevant example"],
                "system_prompt": "You are an AI assistant specialised in cybersecurity and computing in general, you make your point clear without any explanations."
            }
        ]
    )
)
# result
# [
#     {
#         'examples': ['This is an example', 'Another relevant example'],
#         'system_prompt': 'You are an AI assistant specialised in cybersecurity and computing in general, you make your point clear without any explanations.',
#         'model_name': 'meta-llama/Meta-Llama-3.1-70B-Instruct',
#         'generation': 'Disable the firewall on the router',
#     }
# ]
源代码位于 src/distilabel/steps/tasks/text_generation.py
class TextGeneration(Task):
    """Text generation with an `LLM` given a prompt.

    `TextGeneration` is a pre-defined task that allows passing a custom prompt using the
    Jinja2 syntax. By default, a `instruction` is expected in the inputs, but the using
    `template` and `columns` attributes one can define a custom prompt and columns expected
    from the text. This task should be good enough for tasks that don't need post-processing
    of the responses generated by the LLM.

    Attributes:
        system_prompt: The system prompt to use in the generation. If not provided, then
            it will check if the input row has a column named `system_prompt` and use it.
            If not, then no system prompt will be used. Defaults to `None`.
        template: The template to use for the generation. It must follow the Jinja2 template
            syntax. If not provided, it will assume the text passed is an instruction and
            construct the appropriate template.
        columns: A string with the column, or a list with columns expected in the template.
            Take a look at the examples for more information. Defaults to `instruction`.
        use_system_prompt: DEPRECATED. To be removed in 1.5.0. Whether to use the system
            prompt in the generation. Defaults to `True`, which means that if the column
            `system_prompt` is defined within the input batch, then the `system_prompt`
            will be used, otherwise, it will be ignored.

    Input columns:
        - dynamic (determined by `columns` attribute): By default will be set to `instruction`.
            The columns can point both to a `str` or a `List[str]` to be used in the template.

    Output columns:
        - generation (`str`): The generated text.
        - model_name (`str`): The name of the model used to generate the text.

    Categories:
        - text-generation

    References:
        - [Jinja2 Template Designer Documentation](https://jinja.flask.org.cn/en/3.1.x/templates/)

    Examples:
        Generate text from an instruction:

        ```python
        from distilabel.steps.tasks import TextGeneration
        from distilabel.models import InferenceEndpointsLLM

        # Consider this as a placeholder for your actual LLM.
        text_gen = TextGeneration(
            llm=InferenceEndpointsLLM(
                model_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
            )
        )

        text_gen.load()

        result = next(
            text_gen.process(
                [{"instruction": "your instruction"}]
            )
        )
        # result
        # [
        #     {
        #         'instruction': 'your instruction',
        #         'model_name': 'meta-llama/Meta-Llama-3.1-70B-Instruct',
        #         'generation': 'generation',
        #     }
        # ]
        ```

        Use a custom template to generate text:

        ```python
        from distilabel.steps.tasks import TextGeneration
        from distilabel.models import InferenceEndpointsLLM

        CUSTOM_TEMPLATE = '''Document:
        {{ document }}

        Question: {{ question }}

        Please provide a clear and concise answer to the question based on the information in the document and your general knowledge:
        '''.rstrip()

        text_gen = TextGeneration(
            llm=InferenceEndpointsLLM(
                model_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
            ),
            system_prompt="You are a helpful AI assistant. Your task is to answer the following question based on the provided document. If the answer is not explicitly stated in the document, use your knowledge to provide the most relevant and accurate answer possible. If you cannot answer the question based on the given information, state that clearly.",
            template=CUSTOM_TEMPLATE,
            columns=["document", "question"],
        )

        text_gen.load()

        result = next(
            text_gen.process(
                [
                    {
                        "document": "The Great Barrier Reef, located off the coast of Australia, is the world's largest coral reef system. It stretches over 2,300 kilometers and is home to a diverse array of marine life, including over 1,500 species of fish. However, in recent years, the reef has faced significant challenges due to climate change, with rising sea temperatures causing coral bleaching events.",
                        "question": "What is the main threat to the Great Barrier Reef mentioned in the document?"
                    }
                ]
            )
        )
        # result
        # [
        #     {
        #         'document': 'The Great Barrier Reef, located off the coast of Australia, is the world's largest coral reef system. It stretches over 2,300 kilometers and is home to a diverse array of marine life, including over 1,500 species of fish. However, in recent years, the reef has faced significant challenges due to climate change, with rising sea temperatures causing coral bleaching events.',
        #         'question': 'What is the main threat to the Great Barrier Reef mentioned in the document?',
        #         'model_name': 'meta-llama/Meta-Llama-3.1-70B-Instruct',
        #         'generation': 'According to the document, the main threat to the Great Barrier Reef is climate change, specifically rising sea temperatures causing coral bleaching events.',
        #     }
        # ]
        ```

        Few shot learning with different system prompts:

        ```python
        from distilabel.steps.tasks import TextGeneration
        from distilabel.models import InferenceEndpointsLLM

        CUSTOM_TEMPLATE = '''Generate a clear, single-sentence instruction based on the following examples:

        {% for example in examples %}
        Example {{ loop.index }}:
        Instruction: {{ example }}

        {% endfor %}
        Now, generate a new instruction in a similar style:
        '''.rstrip()

        text_gen = TextGeneration(
            llm=InferenceEndpointsLLM(
                model_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
            ),
            template=CUSTOM_TEMPLATE,
            columns="examples",
        )

        text_gen.load()

        result = next(
            text_gen.process(
                [
                    {
                        "examples": ["This is an example", "Another relevant example"],
                        "system_prompt": "You are an AI assistant specialised in cybersecurity and computing in general, you make your point clear without any explanations."
                    }
                ]
            )
        )
        # result
        # [
        #     {
        #         'examples': ['This is an example', 'Another relevant example'],
        #         'system_prompt': 'You are an AI assistant specialised in cybersecurity and computing in general, you make your point clear without any explanations.',
        #         'model_name': 'meta-llama/Meta-Llama-3.1-70B-Instruct',
        #         'generation': 'Disable the firewall on the router',
        #     }
        # ]
        ```
    """

    system_prompt: Union[str, None] = None
    use_system_prompt: bool = Field(default=True, deprecated=True)
    template: str = Field(
        default="{{ instruction }}",
        description=(
            "This is a template or prompt to use for the generation. "
            "If not provided, it is assumed a `instruction` is placed in the inputs, "
            "to be used as is."
        ),
    )
    columns: Union[str, List[str]] = Field(
        default="instruction",
        description=(
            "Custom column or list of columns to include in the input. "
            "If a `template` is provided which needs custom column names, "
            "then they should be provided here. By default it will use `instruction`."
        ),
    )

    _can_be_used_with_offline_batch_generation = True
    _template: Optional["Template"] = PrivateAttr(default=...)

    def model_post_init(self, __context: Any) -> None:
        self.columns = [self.columns] if isinstance(self.columns, str) else self.columns
        super().model_post_init(__context)

    def load(self) -> None:
        super().load()

        for column in self.columns:
            check_column_in_template(column, self.template)

        self._template = Template(self.template)

    def unload(self) -> None:
        super().unload()
        self._template = None

    @property
    def inputs(self) -> "StepColumns":
        """The input for the task is the `instruction` by default, or the `columns` given as input."""
        columns = {column: True for column in self.columns}
        columns["system_prompt"] = False
        return columns

    def _prepare_message_content(self, input: Dict[str, Any]) -> "ChatType":
        """Prepares the content for the template and returns the formatted messages."""
        fields = {column: input[column] for column in self.columns}
        return [{"role": "user", "content": self._template.render(**fields)}]

    def format_input(self, input: Dict[str, Any]) -> "ChatType":
        """The input is formatted as a `ChatType` assuming that the instruction
        is the first interaction from the user within a conversation."""
        # Handle the previous expected errors, in case of custom columns there's more freedom
        # and we cannot check it so easily.
        if self.columns == ["instruction"]:
            if is_openai_format(input["instruction"]):
                raise DistilabelUserError(
                    "Providing `instruction` formatted as an OpenAI chat / conversation is"
                    " deprecated, you should use `ChatGeneration` with `messages` as input instead.",
                    page="components-gallery/tasks/textgeneration/",
                )

            if not isinstance(input["instruction"], str):
                raise DistilabelUserError(
                    f"Input `instruction` must be a string. Got: {input['instruction']}.",
                    page="components-gallery/tasks/textgeneration/",
                )

        messages = self._prepare_message_content(input)

        row_system_prompt = input.get("system_prompt")
        if row_system_prompt:
            messages.insert(0, {"role": "system", "content": row_system_prompt})

        if self.system_prompt and not row_system_prompt:
            messages.insert(0, {"role": "system", "content": self.system_prompt})

        return messages  # type: ignore

    @property
    def outputs(self) -> List[str]:
        """The output for the task is the `generation` and the `model_name`."""
        return ["generation", "model_name"]

    def format_output(
        self, output: Union[str, None], input: Union[Dict[str, Any], None] = None
    ) -> Dict[str, Any]:
        """The output is formatted as a dictionary with the `generation`. The `model_name`
        will be automatically included within the `process` method of `Task`."""
        return {"generation": output}
inputs property

该任务的输入默认为 instruction,或作为输入给出的 columns

outputs property

该任务的输出是 generationmodel_name

_prepare_message_content(input)

准备模板的内容并返回格式化的消息。

源代码位于 src/distilabel/steps/tasks/text_generation.py
def _prepare_message_content(self, input: Dict[str, Any]) -> "ChatType":
    """Prepares the content for the template and returns the formatted messages."""
    fields = {column: input[column] for column in self.columns}
    return [{"role": "user", "content": self._template.render(**fields)}]
format_input(input)

输入被格式化为 ChatType,假设指令是用户在对话中的首次互动。

源代码位于 src/distilabel/steps/tasks/text_generation.py
def format_input(self, input: Dict[str, Any]) -> "ChatType":
    """The input is formatted as a `ChatType` assuming that the instruction
    is the first interaction from the user within a conversation."""
    # Handle the previous expected errors, in case of custom columns there's more freedom
    # and we cannot check it so easily.
    if self.columns == ["instruction"]:
        if is_openai_format(input["instruction"]):
            raise DistilabelUserError(
                "Providing `instruction` formatted as an OpenAI chat / conversation is"
                " deprecated, you should use `ChatGeneration` with `messages` as input instead.",
                page="components-gallery/tasks/textgeneration/",
            )

        if not isinstance(input["instruction"], str):
            raise DistilabelUserError(
                f"Input `instruction` must be a string. Got: {input['instruction']}.",
                page="components-gallery/tasks/textgeneration/",
            )

    messages = self._prepare_message_content(input)

    row_system_prompt = input.get("system_prompt")
    if row_system_prompt:
        messages.insert(0, {"role": "system", "content": row_system_prompt})

    if self.system_prompt and not row_system_prompt:
        messages.insert(0, {"role": "system", "content": self.system_prompt})

    return messages  # type: ignore
format_output(output, input=None)

输出格式化为包含 generation 的字典。model_name 将自动包含在 Taskprocess 方法中。

源代码位于 src/distilabel/steps/tasks/text_generation.py
def format_output(
    self, output: Union[str, None], input: Union[Dict[str, Any], None] = None
) -> Dict[str, Any]:
    """The output is formatted as a dictionary with the `generation`. The `model_name`
    will be automatically included within the `process` method of `Task`."""
    return {"generation": output}

TextGenerationWithImage

基类: TextGeneration

使用 LLM 根据提示和图像生成文本。

`TextGenerationWithImage` is a pre-defined task that allows passing a custom prompt using the
Jinja2 syntax. By default, a `instruction` is expected in the inputs, but the using
`template` and `columns` attributes one can define a custom prompt and columns expected
from the text. Additionally, an `image` column is expected containing one of the
url, base64 encoded image or PIL image. This task inherits from `TextGeneration`,
so all the functionality available in that task related to the prompt will be available
here too.

Attributes:
    system_prompt: The system prompt to use in the generation.
        If not, then no system prompt will be used. Defaults to `None`.
    template: The template to use for the generation. It must follow the Jinja2 template
        syntax. If not provided, it will assume the text passed is an instruction and
        construct the appropriate template.
    columns: A string with the column, or a list with columns expected in the template.
        Take a look at the examples for more information. Defaults to `instruction`.
    image_type: The type of the image provided, this will be used to preprocess if necessary.
        Must be one of "url", "base64" or "PIL".

Input columns:
    - dynamic (determined by `columns` attribute): By default will be set to `instruction`.
        The columns can point both to a `str` or a `list[str]` to be used in the template.
    - image: The column containing the image URL, base64 encoded image or PIL image.

Output columns:
    - generation (`str`): The generated text.
    - model_name (`str`): The name of the model used to generate the text.

Categories:
    - text-generation

References:
    - [Jinja2 Template Designer Documentation](https://jinja.flask.org.cn/en/3.1.x/templates/)
    - [Image-Text-to-Text](https://hugging-face.cn/tasks/image-text-to-text)
    - [OpenAI Vision](https://platform.openai.com/docs/guides/vision)

Examples:
    Answer questions from an image:

    ```python
    from distilabel.steps.tasks import TextGenerationWithImage
    from distilabel.models.llms import InferenceEndpointsLLM

    vision = TextGenerationWithImage(
        name="vision_gen",
        llm=InferenceEndpointsLLM(
            model_id="meta-llama/Llama-3.2-11B-Vision-Instruct",
        ),
        image_type="url"
    )

    vision.load()

    result = next(
        vision.process(
            [
                {
                    "instruction": "What’s in this image?",
                    "image": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
                }
            ]
        )
    )
    # result
    # [
    #     {
    #         "instruction": "What’s in this image?",
    #         "image": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg",
    #         "generation": "Based on the visual cues in the image...",
    #         "model_name": "meta-llama/Llama-3.2-11B-Vision-Instruct"
    #         ... # distilabel_metadata would be here
    #     }
    # ]
    # result[0]["generation"]
    # "Based on the visual cues in the image, here are some possible story points:
  • 图像的特色是一条木制木板路,穿过一片茂盛的草地,可能是在公园或自然保护区。

分析和想法:* 丰富的绿草和树木表明生态系统或栖息地健康。* 根据周围环境,可能存在鸟类或鹿等野生动物。* 人行天桥或小路可能是该区域的常见特征,可通往附近的景点或兴趣点。

要提出的其他问题:* 为什么该区域存在人行天桥?* 该地区栖息着什么样的野生动物?

Answer questions from an image stored as base64:

```python
# For this example we will assume that we have the string representation of the image
# stored, but will just take the image and transform it to base64 to ilustrate the example.
import requests
import base64

image_url ="https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
img = requests.get(image_url).content
base64_image = base64.b64encode(img).decode("utf-8")

from distilabel.steps.tasks import TextGenerationWithImage
from distilabel.models.llms import InferenceEndpointsLLM

vision = TextGenerationWithImage(
    name="vision_gen",
    llm=InferenceEndpointsLLM(
        model_id="meta-llama/Llama-3.2-11B-Vision-Instruct",
    ),
    image_type="base64"
)

vision.load()

result = next(
    vision.process(
        [
            {
                "instruction": "What’s in this image?",
                "image": base64_image
            }
        ]
    )
)

源代码位于 src/distilabel/steps/tasks/text_generation_with_image.py
class TextGenerationWithImage(TextGeneration):
    """Text generation with images with an `LLM` given a prompt.

    `TextGenerationWithImage` is a pre-defined task that allows passing a custom prompt using the
    Jinja2 syntax. By default, a `instruction` is expected in the inputs, but the using
    `template` and `columns` attributes one can define a custom prompt and columns expected
    from the text. Additionally, an `image` column is expected containing one of the
    url, base64 encoded image or PIL image. This task inherits from `TextGeneration`,
    so all the functionality available in that task related to the prompt will be available
    here too.

    Attributes:
        system_prompt: The system prompt to use in the generation.
            If not, then no system prompt will be used. Defaults to `None`.
        template: The template to use for the generation. It must follow the Jinja2 template
            syntax. If not provided, it will assume the text passed is an instruction and
            construct the appropriate template.
        columns: A string with the column, or a list with columns expected in the template.
            Take a look at the examples for more information. Defaults to `instruction`.
        image_type: The type of the image provided, this will be used to preprocess if necessary.
            Must be one of "url", "base64" or "PIL".

    Input columns:
        - dynamic (determined by `columns` attribute): By default will be set to `instruction`.
            The columns can point both to a `str` or a `list[str]` to be used in the template.
        - image: The column containing the image URL, base64 encoded image or PIL image.

    Output columns:
        - generation (`str`): The generated text.
        - model_name (`str`): The name of the model used to generate the text.

    Categories:
        - text-generation

    References:
        - [Jinja2 Template Designer Documentation](https://jinja.flask.org.cn/en/3.1.x/templates/)
        - [Image-Text-to-Text](https://hugging-face.cn/tasks/image-text-to-text)
        - [OpenAI Vision](https://platform.openai.com/docs/guides/vision)

    Examples:
        Answer questions from an image:

        ```python
        from distilabel.steps.tasks import TextGenerationWithImage
        from distilabel.models.llms import InferenceEndpointsLLM

        vision = TextGenerationWithImage(
            name="vision_gen",
            llm=InferenceEndpointsLLM(
                model_id="meta-llama/Llama-3.2-11B-Vision-Instruct",
            ),
            image_type="url"
        )

        vision.load()

        result = next(
            vision.process(
                [
                    {
                        "instruction": "What’s in this image?",
                        "image": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
                    }
                ]
            )
        )
        # result
        # [
        #     {
        #         "instruction": "What\u2019s in this image?",
        #         "image": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg",
        #         "generation": "Based on the visual cues in the image...",
        #         "model_name": "meta-llama/Llama-3.2-11B-Vision-Instruct"
        #         ... # distilabel_metadata would be here
        #     }
        # ]
        # result[0]["generation"]
        # "Based on the visual cues in the image, here are some possible story points:\n\n* The image features a wooden boardwalk leading through a lush grass field, possibly in a park or nature reserve.\n\nAnalysis and Ideas:\n* The abundance of green grass and trees suggests a healthy ecosystem or habitat.\n* The presence of wildlife, such as birds or deer, is possible based on the surroundings.\n* A footbridge or a pathway might be a common feature in this area, providing access to nearby attractions or points of interest.\n\nAdditional Questions to Ask:\n* Why is a footbridge present in this area?\n* What kind of wildlife inhabits this region"
        ```

        Answer questions from an image stored as base64:

        ```python
        # For this example we will assume that we have the string representation of the image
        # stored, but will just take the image and transform it to base64 to ilustrate the example.
        import requests
        import base64

        image_url ="https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
        img = requests.get(image_url).content
        base64_image = base64.b64encode(img).decode("utf-8")

        from distilabel.steps.tasks import TextGenerationWithImage
        from distilabel.models.llms import InferenceEndpointsLLM

        vision = TextGenerationWithImage(
            name="vision_gen",
            llm=InferenceEndpointsLLM(
                model_id="meta-llama/Llama-3.2-11B-Vision-Instruct",
            ),
            image_type="base64"
        )

        vision.load()

        result = next(
            vision.process(
                [
                    {
                        "instruction": "What’s in this image?",
                        "image": base64_image
                    }
                ]
            )
        )
        ```
    """

    image_type: Literal["url", "base64", "PIL"] = Field(
        default="url",
        description="The type of the image provided, this will be used to preprocess if necessary.",
    )

    @property
    def inputs(self) -> "StepColumns":
        columns = super().inputs
        columns["image"] = True
        return columns

    def load(self) -> None:
        Task.load(self)

        for column in self.columns:
            check_column_in_template(
                column, self.template, page="components-gallery/tasks/visiongeneration/"
            )

        self._template = Template(self.template)

    def _transform_image(self, image: Union[str, "Image"]) -> str:
        """Transforms the image based on the `image_type` attribute."""
        if self.image_type == "url":
            return image

        if self.image_type == "base64":
            return f"data:image/jpeg;base64,{image}"

        # Othwerwise, it's a PIL image
        return f"data:image/jpeg;base64,{image_to_str(image)}"

    def _prepare_message_content(self, input: dict[str, Any]) -> "ChatType":
        """Prepares the content for the template and returns the formatted messages."""
        fields = {column: input[column] for column in self.columns}
        img_url = self._transform_image(input["image"])
        return [
            {
                "role": "user",
                "content": [
                    {
                        "type": "text",
                        "text": self._template.render(**fields),
                    },
                    {
                        "type": "image_url",
                        "image_url": {
                            "url": img_url,
                        },
                    },
                ],
            }
        ]

    def format_input(self, input: dict[str, Any]) -> "ChatType":
        """The input is formatted as a `ChatType` assuming that the instruction
        is the first interaction from the user within a conversation."""
        messages = self._prepare_message_content(input)

        if self.system_prompt:
            messages.insert(0, {"role": "system", "content": self.system_prompt})

        return messages  # type: ignore
_transform_image(image)

根据 image_type 属性转换图像。

源代码位于 src/distilabel/steps/tasks/text_generation_with_image.py
def _transform_image(self, image: Union[str, "Image"]) -> str:
    """Transforms the image based on the `image_type` attribute."""
    if self.image_type == "url":
        return image

    if self.image_type == "base64":
        return f"data:image/jpeg;base64,{image}"

    # Othwerwise, it's a PIL image
    return f"data:image/jpeg;base64,{image_to_str(image)}"
_prepare_message_content(input)

准备模板的内容并返回格式化的消息。

源代码位于 src/distilabel/steps/tasks/text_generation_with_image.py
def _prepare_message_content(self, input: dict[str, Any]) -> "ChatType":
    """Prepares the content for the template and returns the formatted messages."""
    fields = {column: input[column] for column in self.columns}
    img_url = self._transform_image(input["image"])
    return [
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": self._template.render(**fields),
                },
                {
                    "type": "image_url",
                    "image_url": {
                        "url": img_url,
                    },
                },
            ],
        }
    ]
format_input(input)

输入被格式化为 ChatType,假设指令是用户在对话中的首次互动。

源代码位于 src/distilabel/steps/tasks/text_generation_with_image.py
def format_input(self, input: dict[str, Any]) -> "ChatType":
    """The input is formatted as a `ChatType` assuming that the instruction
    is the first interaction from the user within a conversation."""
    messages = self._prepare_message_content(input)

    if self.system_prompt:
        messages.insert(0, {"role": "system", "content": self.system_prompt})

    return messages  # type: ignore

UltraFeedback

基类:Task

使用 LLM 对关注不同方面的生成结果进行排序。

UltraFeedback:通过高质量反馈提升语言模型。

属性

名称 类型 描述
aspect Literal['helpfulness', 'honesty', 'instruction-following', 'truthfulness', 'overall-rating']

使用 UltraFeedback 模型执行的方面。可用方面包括:- helpfulness:根据 helpfulness 评估文本输出。- honesty:根据 honesty 评估文本输出。- instruction-following:根据给定的指令评估文本输出。- truthfulness:根据 truthfulness 评估文本输出。此外,Argilla 定义了一个自定义方面,以便在单个提示中评估文本输出的总体评估。自定义方面是:- overall-rating:根据总体评估评估文本输出。默认为 "overall-rating"

输入列
  • instruction (str): 用于评估文本输出的参考指令。
  • generations (List[str]): 要针对给定指令评估的文本输出。
输出列
  • ratings (List[float]): 每个提供的文本输出的评分。
  • rationales (List[str]): 每个提供的文本输出的理由。
  • model_name (str): 用于生成评分和理由的模型名称。
类别
  • preference
参考

示例

根据所选方面对来自不同 LLM 的生成结果进行评分

from distilabel.steps.tasks import UltraFeedback
from distilabel.models import InferenceEndpointsLLM

# Consider this as a placeholder for your actual LLM.
ultrafeedback = UltraFeedback(
    llm=InferenceEndpointsLLM(
        model_id="mistralai/Mistral-7B-Instruct-v0.2",
    ),
    use_default_structured_output=False
)

ultrafeedback.load()

result = next(
    ultrafeedback.process(
        [
            {
                "instruction": "How much is 2+2?",
                "generations": ["4", "and a car"],
            }
        ]
    )
)
# result
# [
#     {
#         'instruction': 'How much is 2+2?',
#         'generations': ['4', 'and a car'],
#         'ratings': [1, 2],
#         'rationales': ['explanation for 4', 'explanation for and a car'],
#         'model_name': 'mistralai/Mistral-7B-Instruct-v0.2',
#     }
# ]

使用默认结构化输出,根据 honesty 对来自不同 LLM 的生成结果进行评分

from distilabel.steps.tasks import UltraFeedback
from distilabel.models import InferenceEndpointsLLM

# Consider this as a placeholder for your actual LLM.
ultrafeedback = UltraFeedback(
    llm=InferenceEndpointsLLM(
        model_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
    ),
    aspect="honesty"
)

ultrafeedback.load()

result = next(
    ultrafeedback.process(
        [
            {
                "instruction": "How much is 2+2?",
                "generations": ["4", "and a car"],
            }
        ]
    )
)
# result
# [{'instruction': 'How much is 2+2?',
# 'generations': ['4', 'and a car'],
# 'ratings': [5, 1],
# 'rationales': ['The response is correct and confident, as it directly answers the question without expressing any uncertainty or doubt.',
# "The response is confidently incorrect, as it provides unrelated information ('a car') and does not address the question. The model shows no uncertainty or indication that it does not know the answer."],
# 'distilabel_metadata': {'raw_output_ultra_feedback_0': '{"ratings": [\n    5,\n    1\n] \n\n,"rationales": [\n    "The response is correct and confident, as it directly answers the question without expressing any uncertainty or doubt.",\n    "The response is confidently incorrect, as it provides unrelated information ('a car') and does not address the question. The model shows no uncertainty or indication that it does not know the answer."\n] }'},
# 'model_name': 'meta-llama/Meta-Llama-3.1-70B-Instruct'}]

使用默认结构化输出,根据 helpfulness 对来自不同 LLM 的生成结果进行评分

from distilabel.steps.tasks import UltraFeedback
from distilabel.models import InferenceEndpointsLLM

# Consider this as a placeholder for your actual LLM.
ultrafeedback = UltraFeedback(
    llm=InferenceEndpointsLLM(
        model_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
        generation_kwargs={"max_new_tokens": 512},
    ),
    aspect="helpfulness"
)

ultrafeedback.load()

result = next(
    ultrafeedback.process(
        [
            {
                "instruction": "How much is 2+2?",
                "generations": ["4", "and a car"],
            }
        ]
    )
)
# result
# [{'instruction': 'How much is 2+2?',
#   'generations': ['4', 'and a car'],
#   'ratings': [1, 5],
#   'rationales': ['Text 1 is clear and relevant, providing the correct answer to the question. It is also not lengthy and does not contain repetition. However, it lacks comprehensive information or detailed description.',
#    'Text 2 is neither clear nor relevant to the task. It does not provide any useful information and seems unrelated to the question.'],
#   'rationales_for_rating': ['Text 1 is rated as Correct (3) because it provides the accurate answer to the question, but lacks comprehensive information or detailed description.',
#    'Text 2 is rated as Severely Incorrect (1) because it does not provide any relevant information and seems unrelated to the question.'],
#   'types': [1, 3, 1],
#   'distilabel_metadata': {'raw_output_ultra_feedback_0': '{ \n  "ratings": [\n    1,\n    5\n  ]\n ,\n  "rationales": [\n    "Text 1 is clear and relevant, providing the correct answer to the question. It is also not lengthy and does not contain repetition. However, it lacks comprehensive information or detailed description.",\n    "Text 2 is neither clear nor relevant to the task. It does not provide any useful information and seems unrelated to the question."\n  ]\n ,\n  "rationales_for_rating": [\n    "Text 1 is rated as Correct (3) because it provides the accurate answer to the question, but lacks comprehensive information or detailed description.",\n    "Text 2 is rated as Severely Incorrect (1) because it does not provide any relevant information and seems unrelated to the question."\n  ]\n ,\n  "types": [\n    1, 3,\n    1\n  ]\n  }'},
#   'model_name': 'meta-llama/Meta-Llama-3.1-70B-Instruct'}]
引用
@misc{cui2024ultrafeedbackboostinglanguagemodels,
    title={UltraFeedback: Boosting Language Models with Scaled AI Feedback},
    author={Ganqu Cui and Lifan Yuan and Ning Ding and Guanming Yao and Bingxiang He and Wei Zhu and Yuan Ni and Guotong Xie and Ruobing Xie and Yankai Lin and Zhiyuan Liu and Maosong Sun},
    year={2024},
    eprint={2310.01377},
    archivePrefix={arXiv},
    primaryClass={cs.CL},
    url={https://arxiv.org/abs/2310.01377},
}
源代码位于 src/distilabel/steps/tasks/ultrafeedback.py
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
class UltraFeedback(Task):
    """Rank generations focusing on different aspects using an `LLM`.

    UltraFeedback: Boosting Language Models with High-quality Feedback.

    Attributes:
        aspect: The aspect to perform with the `UltraFeedback` model. The available aspects are:
            - `helpfulness`: Evaluate text outputs based on helpfulness.
            - `honesty`: Evaluate text outputs based on honesty.
            - `instruction-following`: Evaluate text outputs based on given instructions.
            - `truthfulness`: Evaluate text outputs based on truthfulness.
            Additionally, a custom aspect has been defined by Argilla, so as to evaluate the overall
            assessment of the text outputs within a single prompt. The custom aspect is:
            - `overall-rating`: Evaluate text outputs based on an overall assessment.
            Defaults to `"overall-rating"`.

    Input columns:
        - instruction (`str`): The reference instruction to evaluate the text outputs.
        - generations (`List[str]`): The text outputs to evaluate for the given instruction.

    Output columns:
        - ratings (`List[float]`): The ratings for each of the provided text outputs.
        - rationales (`List[str]`): The rationales for each of the provided text outputs.
        - model_name (`str`): The name of the model used to generate the ratings and rationales.

    Categories:
        - preference

    References:
        - [`UltraFeedback: Boosting Language Models with High-quality Feedback`](https://arxiv.org/abs/2310.01377)
        - [`UltraFeedback - GitHub Repository`](https://github.com/OpenBMB/UltraFeedback)

    Examples:
        Rate generations from different LLMs based on the selected aspect:

        ```python
        from distilabel.steps.tasks import UltraFeedback
        from distilabel.models import InferenceEndpointsLLM

        # Consider this as a placeholder for your actual LLM.
        ultrafeedback = UltraFeedback(
            llm=InferenceEndpointsLLM(
                model_id="mistralai/Mistral-7B-Instruct-v0.2",
            ),
            use_default_structured_output=False
        )

        ultrafeedback.load()

        result = next(
            ultrafeedback.process(
                [
                    {
                        "instruction": "How much is 2+2?",
                        "generations": ["4", "and a car"],
                    }
                ]
            )
        )
        # result
        # [
        #     {
        #         'instruction': 'How much is 2+2?',
        #         'generations': ['4', 'and a car'],
        #         'ratings': [1, 2],
        #         'rationales': ['explanation for 4', 'explanation for and a car'],
        #         'model_name': 'mistralai/Mistral-7B-Instruct-v0.2',
        #     }
        # ]
        ```

        Rate generations from different LLMs based on the honesty, using the default structured output:

        ```python
        from distilabel.steps.tasks import UltraFeedback
        from distilabel.models import InferenceEndpointsLLM

        # Consider this as a placeholder for your actual LLM.
        ultrafeedback = UltraFeedback(
            llm=InferenceEndpointsLLM(
                model_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
            ),
            aspect="honesty"
        )

        ultrafeedback.load()

        result = next(
            ultrafeedback.process(
                [
                    {
                        "instruction": "How much is 2+2?",
                        "generations": ["4", "and a car"],
                    }
                ]
            )
        )
        # result
        # [{'instruction': 'How much is 2+2?',
        # 'generations': ['4', 'and a car'],
        # 'ratings': [5, 1],
        # 'rationales': ['The response is correct and confident, as it directly answers the question without expressing any uncertainty or doubt.',
        # "The response is confidently incorrect, as it provides unrelated information ('a car') and does not address the question. The model shows no uncertainty or indication that it does not know the answer."],
        # 'distilabel_metadata': {'raw_output_ultra_feedback_0': '{"ratings": [\\n    5,\\n    1\\n] \\n\\n,"rationales": [\\n    "The response is correct and confident, as it directly answers the question without expressing any uncertainty or doubt.",\\n    "The response is confidently incorrect, as it provides unrelated information (\'a car\') and does not address the question. The model shows no uncertainty or indication that it does not know the answer."\\n] }'},
        # 'model_name': 'meta-llama/Meta-Llama-3.1-70B-Instruct'}]
        ```

        Rate generations from different LLMs based on the helpfulness, using the default structured output:

        ```python
        from distilabel.steps.tasks import UltraFeedback
        from distilabel.models import InferenceEndpointsLLM

        # Consider this as a placeholder for your actual LLM.
        ultrafeedback = UltraFeedback(
            llm=InferenceEndpointsLLM(
                model_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
                generation_kwargs={"max_new_tokens": 512},
            ),
            aspect="helpfulness"
        )

        ultrafeedback.load()

        result = next(
            ultrafeedback.process(
                [
                    {
                        "instruction": "How much is 2+2?",
                        "generations": ["4", "and a car"],
                    }
                ]
            )
        )
        # result
        # [{'instruction': 'How much is 2+2?',
        #   'generations': ['4', 'and a car'],
        #   'ratings': [1, 5],
        #   'rationales': ['Text 1 is clear and relevant, providing the correct answer to the question. It is also not lengthy and does not contain repetition. However, it lacks comprehensive information or detailed description.',
        #    'Text 2 is neither clear nor relevant to the task. It does not provide any useful information and seems unrelated to the question.'],
        #   'rationales_for_rating': ['Text 1 is rated as Correct (3) because it provides the accurate answer to the question, but lacks comprehensive information or detailed description.',
        #    'Text 2 is rated as Severely Incorrect (1) because it does not provide any relevant information and seems unrelated to the question.'],
        #   'types': [1, 3, 1],
        #   'distilabel_metadata': {'raw_output_ultra_feedback_0': '{ \\n  "ratings": [\\n    1,\\n    5\\n  ]\\n ,\\n  "rationales": [\\n    "Text 1 is clear and relevant, providing the correct answer to the question. It is also not lengthy and does not contain repetition. However, it lacks comprehensive information or detailed description.",\\n    "Text 2 is neither clear nor relevant to the task. It does not provide any useful information and seems unrelated to the question."\\n  ]\\n ,\\n  "rationales_for_rating": [\\n    "Text 1 is rated as Correct (3) because it provides the accurate answer to the question, but lacks comprehensive information or detailed description.",\\n    "Text 2 is rated as Severely Incorrect (1) because it does not provide any relevant information and seems unrelated to the question."\\n  ]\\n ,\\n  "types": [\\n    1, 3,\\n    1\\n  ]\\n  }'},
        #   'model_name': 'meta-llama/Meta-Llama-3.1-70B-Instruct'}]
        ```

    Citations:
        ```
        @misc{cui2024ultrafeedbackboostinglanguagemodels,
            title={UltraFeedback: Boosting Language Models with Scaled AI Feedback},
            author={Ganqu Cui and Lifan Yuan and Ning Ding and Guanming Yao and Bingxiang He and Wei Zhu and Yuan Ni and Guotong Xie and Ruobing Xie and Yankai Lin and Zhiyuan Liu and Maosong Sun},
            year={2024},
            eprint={2310.01377},
            archivePrefix={arXiv},
            primaryClass={cs.CL},
            url={https://arxiv.org/abs/2310.01377},
        }
        ```
    """

    aspect: Literal[
        "helpfulness",
        "honesty",
        "instruction-following",
        "truthfulness",
        # Custom aspects
        "overall-rating",
    ] = "overall-rating"

    _system_prompt: str = PrivateAttr(
        default=(
            "Your role is to evaluate text quality based on given criteria.\n"
            'You\'ll receive an instructional description ("Instruction") and {no_texts} text outputs ("Text").\n'
            "Understand and interpret instructions to evaluate effectively.\n"
            "Provide annotations for each text with a rating and rationale.\n"
            "The {no_texts} texts given are independent, and should be evaluated separately.\n"
        )
    )
    _template: Optional["Template"] = PrivateAttr(default=...)
    _can_be_used_with_offline_batch_generation = True

    def load(self) -> None:
        """Loads the Jinja2 template for the given `aspect`."""
        super().load()

        _path = str(
            importlib_resources.files("distilabel")
            / "steps"
            / "tasks"
            / "templates"
            / "ultrafeedback"
            / f"{self.aspect}.jinja2"
        )

        self._template = Template(open(_path).read())

    @property
    def inputs(self) -> List[str]:
        """The input for the task is the `instruction`, and the `generations` for it."""
        return ["instruction", "generations"]

    def format_input(self, input: Dict[str, Any]) -> ChatType:
        """The input is formatted as a `ChatType` assuming that the instruction
        is the first interaction from the user within a conversation."""
        return [
            {
                "role": "system",
                "content": self._system_prompt.format(
                    no_texts=len(input["generations"])
                ),
            },
            {
                "role": "user",
                "content": self._template.render(  # type: ignore
                    instruction=input["instruction"], generations=input["generations"]
                ),
            },
        ]

    @property
    def outputs(self) -> List[str]:
        """The output for the task is the `generation` and the `model_name`."""
        columns = []
        if self.aspect in ["honesty", "instruction-following", "overall-rating"]:
            columns = ["ratings", "rationales"]
        elif self.aspect in ["helpfulness", "truthfulness"]:
            columns = ["types", "rationales", "ratings", "rationales-for-ratings"]
        return columns + ["model_name"]

    def format_output(
        self, output: Union[str, None], input: Union[Dict[str, Any], None] = None
    ) -> Dict[str, Any]:
        """The output is formatted as a dictionary with the `ratings` and `rationales` for
        each of the provided `generations` for the given `instruction`. The `model_name`
        will be automatically included within the `process` method of `Task`.

        Args:
            output: a string representing the output of the LLM via the `process` method.
            input: the input to the task, as required by some tasks to format the output.

        Returns:
            A dictionary containing either the `ratings` and `rationales` for each of the provided
            `generations` for the given `instruction` if the provided aspect is either `honesty`,
            `instruction-following`, or `overall-rating`; or the `types`, `rationales`,
            `ratings`, and `rationales-for-ratings` for each of the provided `generations` for the
            given `instruction` if the provided aspect is either `helpfulness` or `truthfulness`.
        """
        assert input is not None, "Input is required to format the output."

        if self.aspect in [
            "honesty",
            "instruction-following",
            "overall-rating",
        ]:
            return self._format_ratings_rationales_output(output, input)

        return self._format_types_ratings_rationales_output(output, input)

    def _format_ratings_rationales_output(
        self, output: Union[str, None], input: Dict[str, Any]
    ) -> Dict[str, List[Any]]:
        """Formats the output when the aspect is either `honesty`, `instruction-following`, or `overall-rating`."""
        if output is None:
            return {
                "ratings": [None] * len(input["generations"]),
                "rationales": [None] * len(input["generations"]),
            }

        if self.use_default_structured_output:
            return self._format_structured_output(output, input)

        pattern = r"Rating: (.+?)\nRationale: (.+)"
        sections = output.split("\n\n")

        formatted_outputs = []
        for section in sections:
            matches = None
            if section is not None and section != "":
                matches = re.search(pattern, section, re.DOTALL)
            if not matches:
                formatted_outputs.append({"ratings": None, "rationales": None})
                continue

            formatted_outputs.append(
                {
                    "ratings": (
                        int(re.findall(r"\b\d+\b", matches.group(1))[0])
                        if matches.group(1) not in ["None", "N/A"]
                        else None
                    ),
                    "rationales": matches.group(2),
                }
            )
        return group_dicts(*formatted_outputs)

    def _format_types_ratings_rationales_output(
        self, output: Union[str, None], input: Dict[str, Any]
    ) -> Dict[str, List[Any]]:
        """Formats the output when the aspect is either `helpfulness` or `truthfulness`."""
        if output is None:
            return {
                "types": [None] * len(input["generations"]),
                "rationales": [None] * len(input["generations"]),
                "ratings": [None] * len(input["generations"]),
                "rationales-for-ratings": [None] * len(input["generations"]),
            }

        if self.use_default_structured_output:
            return self._format_structured_output(output, input)

        pattern = r"Type: (.+?)\nRationale: (.+?)\nRating: (.+?)\nRationale: (.+)"

        sections = output.split("\n\n")

        formatted_outputs = []
        for section in sections:
            matches = None
            if section is not None and section != "":
                matches = re.search(pattern, section, re.DOTALL)
            if not matches:
                formatted_outputs.append(
                    {
                        "types": None,
                        "rationales": None,
                        "ratings": None,
                        "rationales-for-ratings": None,
                    }
                )
                continue

            formatted_outputs.append(
                {
                    "types": (
                        int(re.findall(r"\b\d+\b", matches.group(1))[0])
                        if matches.group(1) not in ["None", "N/A"]
                        else None
                    ),
                    "rationales": matches.group(2),
                    "ratings": (
                        int(re.findall(r"\b\d+\b", matches.group(3))[0])
                        if matches.group(3) not in ["None", "N/A"]
                        else None
                    ),
                    "rationales-for-ratings": matches.group(4),
                }
            )
        return group_dicts(*formatted_outputs)

    @override
    def get_structured_output(self) -> Dict[str, Any]:
        """Creates the json schema to be passed to the LLM, to enforce generating
        a dictionary with the output which can be directly parsed as a python dictionary.

        The schema corresponds to the following:

        ```python
        from pydantic import BaseModel
        from typing import List

        class SchemaUltraFeedback(BaseModel):
            ratings: List[int]
            rationales: List[str]

        class SchemaUltraFeedbackWithType(BaseModel):
            types: List[Optional[int]]
            ratings: List[int]
            rationales: List[str]
            rationales_for_rating: List[str]
        ```

        Returns:
            JSON Schema of the response to enforce.
        """
        if self.aspect in [
            "honesty",
            "instruction-following",
            "overall-rating",
        ]:
            return {
                "properties": {
                    "ratings": {
                        "items": {"type": "integer"},
                        "title": "Ratings",
                        "type": "array",
                    },
                    "rationales": {
                        "items": {"type": "string"},
                        "title": "Rationales",
                        "type": "array",
                    },
                },
                "required": ["ratings", "rationales"],
                "title": "SchemaUltraFeedback",
                "type": "object",
            }
        return {
            "properties": {
                "types": {
                    "items": {"anyOf": [{"type": "integer"}, {"type": "null"}]},
                    "title": "Types",
                    "type": "array",
                },
                "ratings": {
                    "items": {"type": "integer"},
                    "title": "Ratings",
                    "type": "array",
                },
                "rationales": {
                    "items": {"type": "string"},
                    "title": "Rationales",
                    "type": "array",
                },
                "rationales_for_rating": {
                    "items": {"type": "string"},
                    "title": "Rationales For Rating",
                    "type": "array",
                },
            },
            "required": ["types", "ratings", "rationales", "rationales_for_rating"],
            "title": "SchemaUltraFeedbackWithType",
            "type": "object",
        }

    def _format_structured_output(
        self, output: str, input: Dict[str, Any]
    ) -> Dict[str, Any]:
        """Parses the structured response, which should correspond to a dictionary
        with either `positive`, or `positive` and `negative` keys.

        Args:
            output: The output from the `LLM`.

        Returns:
            Formatted output.
        """
        try:
            return orjson.loads(output)
        except orjson.JSONDecodeError:
            if self.aspect in [
                "honesty",
                "instruction-following",
                "overall-rating",
            ]:
                return {
                    "ratings": [None] * len(input["generations"]),
                    "rationales": [None] * len(input["generations"]),
                }
            return {
                "ratings": [None] * len(input["generations"]),
                "rationales": [None] * len(input["generations"]),
                "types": [None] * len(input["generations"]),
                "rationales-for-ratings": [None] * len(input["generations"]),
            }

    @override
    def _sample_input(self) -> ChatType:
        return self.format_input(
            {
                "instruction": f"<PLACEHOLDER_{'instruction'.upper()}>",
                "generations": [
                    f"<PLACEHOLDER_{f'GENERATION_{i}'.upper()}>" for i in range(2)
                ],
            }
        )
inputs property

该任务的输入是 instruction 和其对应的 generations

outputs property

该任务的输出是 generationmodel_name

load()

加载给定 aspect 的 Jinja2 模板。

源代码位于 src/distilabel/steps/tasks/ultrafeedback.py
def load(self) -> None:
    """Loads the Jinja2 template for the given `aspect`."""
    super().load()

    _path = str(
        importlib_resources.files("distilabel")
        / "steps"
        / "tasks"
        / "templates"
        / "ultrafeedback"
        / f"{self.aspect}.jinja2"
    )

    self._template = Template(open(_path).read())
format_input(input)

输入被格式化为 ChatType,假设指令是用户在对话中的首次互动。

源代码位于 src/distilabel/steps/tasks/ultrafeedback.py
def format_input(self, input: Dict[str, Any]) -> ChatType:
    """The input is formatted as a `ChatType` assuming that the instruction
    is the first interaction from the user within a conversation."""
    return [
        {
            "role": "system",
            "content": self._system_prompt.format(
                no_texts=len(input["generations"])
            ),
        },
        {
            "role": "user",
            "content": self._template.render(  # type: ignore
                instruction=input["instruction"], generations=input["generations"]
            ),
        },
    ]
format_output(output, input=None)

输出格式为字典,其中包含针对给定 instruction 的每个提供的 generationsratingsrationalesmodel_name 将自动包含在 Taskprocess 方法中。

参数

名称 类型 描述 默认值
output Union[str, None]

一个字符串,表示通过 process 方法获得的 LLM 的输出。

必需
input Union[Dict[str, Any], None]

任务的输入,某些任务需要输入以格式化输出。

None

返回

类型 描述
Dict[str, Any]

一个字典,其中包含针对提供的

Dict[str, Any]

如果提供的方面是 honesty,则为给定 instructiongenerationsratingsrationales

Dict[str, Any]

instruction-followingoverall-rating;或者 typesrationales

Dict[str, Any]

如果提供的方面是 helpfulnesstruthfulness,则为给定 instruction 的每个提供的 generationsratingsrationales-for-ratings

Dict[str, Any]

给定 instruction,如果提供的方面是 helpfulnesstruthfulness

源代码位于 src/distilabel/steps/tasks/ultrafeedback.py
def format_output(
    self, output: Union[str, None], input: Union[Dict[str, Any], None] = None
) -> Dict[str, Any]:
    """The output is formatted as a dictionary with the `ratings` and `rationales` for
    each of the provided `generations` for the given `instruction`. The `model_name`
    will be automatically included within the `process` method of `Task`.

    Args:
        output: a string representing the output of the LLM via the `process` method.
        input: the input to the task, as required by some tasks to format the output.

    Returns:
        A dictionary containing either the `ratings` and `rationales` for each of the provided
        `generations` for the given `instruction` if the provided aspect is either `honesty`,
        `instruction-following`, or `overall-rating`; or the `types`, `rationales`,
        `ratings`, and `rationales-for-ratings` for each of the provided `generations` for the
        given `instruction` if the provided aspect is either `helpfulness` or `truthfulness`.
    """
    assert input is not None, "Input is required to format the output."

    if self.aspect in [
        "honesty",
        "instruction-following",
        "overall-rating",
    ]:
        return self._format_ratings_rationales_output(output, input)

    return self._format_types_ratings_rationales_output(output, input)
_format_ratings_rationales_output(output, input)

当方面是 honestyinstruction-followingoverall-rating 时,格式化输出。

源代码位于 src/distilabel/steps/tasks/ultrafeedback.py
def _format_ratings_rationales_output(
    self, output: Union[str, None], input: Dict[str, Any]
) -> Dict[str, List[Any]]:
    """Formats the output when the aspect is either `honesty`, `instruction-following`, or `overall-rating`."""
    if output is None:
        return {
            "ratings": [None] * len(input["generations"]),
            "rationales": [None] * len(input["generations"]),
        }

    if self.use_default_structured_output:
        return self._format_structured_output(output, input)

    pattern = r"Rating: (.+?)\nRationale: (.+)"
    sections = output.split("\n\n")

    formatted_outputs = []
    for section in sections:
        matches = None
        if section is not None and section != "":
            matches = re.search(pattern, section, re.DOTALL)
        if not matches:
            formatted_outputs.append({"ratings": None, "rationales": None})
            continue

        formatted_outputs.append(
            {
                "ratings": (
                    int(re.findall(r"\b\d+\b", matches.group(1))[0])
                    if matches.group(1) not in ["None", "N/A"]
                    else None
                ),
                "rationales": matches.group(2),
            }
        )
    return group_dicts(*formatted_outputs)
_format_types_ratings_rationales_output(output, input)

当方面是 helpfulnesstruthfulness 时,格式化输出。

源代码位于 src/distilabel/steps/tasks/ultrafeedback.py
def _format_types_ratings_rationales_output(
    self, output: Union[str, None], input: Dict[str, Any]
) -> Dict[str, List[Any]]:
    """Formats the output when the aspect is either `helpfulness` or `truthfulness`."""
    if output is None:
        return {
            "types": [None] * len(input["generations"]),
            "rationales": [None] * len(input["generations"]),
            "ratings": [None] * len(input["generations"]),
            "rationales-for-ratings": [None] * len(input["generations"]),
        }

    if self.use_default_structured_output:
        return self._format_structured_output(output, input)

    pattern = r"Type: (.+?)\nRationale: (.+?)\nRating: (.+?)\nRationale: (.+)"

    sections = output.split("\n\n")

    formatted_outputs = []
    for section in sections:
        matches = None
        if section is not None and section != "":
            matches = re.search(pattern, section, re.DOTALL)
        if not matches:
            formatted_outputs.append(
                {
                    "types": None,
                    "rationales": None,
                    "ratings": None,
                    "rationales-for-ratings": None,
                }
            )
            continue

        formatted_outputs.append(
            {
                "types": (
                    int(re.findall(r"\b\d+\b", matches.group(1))[0])
                    if matches.group(1) not in ["None", "N/A"]
                    else None
                ),
                "rationales": matches.group(2),
                "ratings": (
                    int(re.findall(r"\b\d+\b", matches.group(3))[0])
                    if matches.group(3) not in ["None", "N/A"]
                    else None
                ),
                "rationales-for-ratings": matches.group(4),
            }
        )
    return group_dicts(*formatted_outputs)
get_structured_output()

创建要传递给 LLM 的 json 模式,以强制生成一个字典,该字典的输出可以直接解析为 python 字典。

该模式对应于以下内容

from pydantic import BaseModel
from typing import List

class SchemaUltraFeedback(BaseModel):
    ratings: List[int]
    rationales: List[str]

class SchemaUltraFeedbackWithType(BaseModel):
    types: List[Optional[int]]
    ratings: List[int]
    rationales: List[str]
    rationales_for_rating: List[str]

返回

类型 描述
Dict[str, Any]

强制执行的响应的 JSON 模式。

源代码位于 src/distilabel/steps/tasks/ultrafeedback.py
@override
def get_structured_output(self) -> Dict[str, Any]:
    """Creates the json schema to be passed to the LLM, to enforce generating
    a dictionary with the output which can be directly parsed as a python dictionary.

    The schema corresponds to the following:

    ```python
    from pydantic import BaseModel
    from typing import List

    class SchemaUltraFeedback(BaseModel):
        ratings: List[int]
        rationales: List[str]

    class SchemaUltraFeedbackWithType(BaseModel):
        types: List[Optional[int]]
        ratings: List[int]
        rationales: List[str]
        rationales_for_rating: List[str]
    ```

    Returns:
        JSON Schema of the response to enforce.
    """
    if self.aspect in [
        "honesty",
        "instruction-following",
        "overall-rating",
    ]:
        return {
            "properties": {
                "ratings": {
                    "items": {"type": "integer"},
                    "title": "Ratings",
                    "type": "array",
                },
                "rationales": {
                    "items": {"type": "string"},
                    "title": "Rationales",
                    "type": "array",
                },
            },
            "required": ["ratings", "rationales"],
            "title": "SchemaUltraFeedback",
            "type": "object",
        }
    return {
        "properties": {
            "types": {
                "items": {"anyOf": [{"type": "integer"}, {"type": "null"}]},
                "title": "Types",
                "type": "array",
            },
            "ratings": {
                "items": {"type": "integer"},
                "title": "Ratings",
                "type": "array",
            },
            "rationales": {
                "items": {"type": "string"},
                "title": "Rationales",
                "type": "array",
            },
            "rationales_for_rating": {
                "items": {"type": "string"},
                "title": "Rationales For Rating",
                "type": "array",
            },
        },
        "required": ["types", "ratings", "rationales", "rationales_for_rating"],
        "title": "SchemaUltraFeedbackWithType",
        "type": "object",
    }
_format_structured_output(output, input)

解析结构化响应,该响应应对应于一个字典,其中包含 positivepositivenegative 键。

参数

名称 类型 描述 默认值
output str

来自 LLM 的输出。

必需

返回

类型 描述
Dict[str, Any]

格式化后的输出。

源代码位于 src/distilabel/steps/tasks/ultrafeedback.py
def _format_structured_output(
    self, output: str, input: Dict[str, Any]
) -> Dict[str, Any]:
    """Parses the structured response, which should correspond to a dictionary
    with either `positive`, or `positive` and `negative` keys.

    Args:
        output: The output from the `LLM`.

    Returns:
        Formatted output.
    """
    try:
        return orjson.loads(output)
    except orjson.JSONDecodeError:
        if self.aspect in [
            "honesty",
            "instruction-following",
            "overall-rating",
        ]:
            return {
                "ratings": [None] * len(input["generations"]),
                "rationales": [None] * len(input["generations"]),
            }
        return {
            "ratings": [None] * len(input["generations"]),
            "rationales": [None] * len(input["generations"]),
            "types": [None] * len(input["generations"]),
            "rationales-for-ratings": [None] * len(input["generations"]),
        }

URIAL

基类: Task

使用非指令微调模型生成响应。

URIAL 是一个预定义的任务,它使用非指令微调模型生成响应。此任务用于根据作为输入提供的对话生成响应。

输入列
  • instruction (str, optional): 用于生成响应的指令(可选)。
  • conversation (List[Dict[str, str]], optional): 用于生成响应的对话(可选)(最后一条消息必须来自用户)。
输出列
  • generation (str): 生成的响应。
  • model_name (str): 用于生成响应的模型的名称。
类别
  • 文本生成
参考

示例

从指令生成文本

from distilabel.models import vLLM
from distilabel.steps.tasks import URIAL

step = URIAL(
    llm=vLLM(
        model="meta-llama/Meta-Llama-3.1-8B",
        generation_kwargs={"temperature": 0.7},
    ),
)

step.load()

results = next(
    step.process(inputs=[{"instruction": "What's the most most common type of cloud?"}])
)
# [
#     {
#         'instruction': "What's the most most common type of cloud?",
#         'generation': 'Clouds are classified into three main types, high, middle, and low. The most common type of cloud is the middle cloud.',
#         'distilabel_metadata': {...},
#         'model_name': 'meta-llama/Meta-Llama-3.1-8B'
#     }
# ]
源代码位于 src/distilabel/steps/tasks/urial.py
class URIAL(Task):
    """Generates a response using a non-instruct fine-tuned model.

    `URIAL` is a pre-defined task that generates a response using a non-instruct fine-tuned
    model. This task is used to generate a response based on the conversation provided as
    input.

    Input columns:
        - instruction (`str`, optional): The instruction to generate a response from.
        - conversation (`List[Dict[str, str]]`, optional): The conversation to generate
            a response from (the last message must be from the user).

    Output columns:
        - generation (`str`): The generated response.
        - model_name (`str`): The name of the model used to generate the response.

    Categories:
        - text-generation

    References:
        - [The Unlocking Spell on Base LLMs: Rethinking Alignment via In-Context Learning](https://arxiv.org/abs/2312.01552)

    Examples:
        Generate text from an instruction:

        ```python
        from distilabel.models import vLLM
        from distilabel.steps.tasks import URIAL

        step = URIAL(
            llm=vLLM(
                model="meta-llama/Meta-Llama-3.1-8B",
                generation_kwargs={"temperature": 0.7},
            ),
        )

        step.load()

        results = next(
            step.process(inputs=[{"instruction": "What's the most most common type of cloud?"}])
        )
        # [
        #     {
        #         'instruction': "What's the most most common type of cloud?",
        #         'generation': 'Clouds are classified into three main types, high, middle, and low. The most common type of cloud is the middle cloud.',
        #         'distilabel_metadata': {...},
        #         'model_name': 'meta-llama/Meta-Llama-3.1-8B'
        #     }
        # ]
        ```
    """

    def load(self) -> None:
        """Loads the Jinja2 template for the given `aspect`."""
        super().load()

        _path = str(
            importlib_resources.files("distilabel")
            / "steps"
            / "tasks"
            / "templates"
            / "urial.jinja2"
        )

        self._template = Template(open(_path).read())

    @property
    def inputs(self) -> "StepColumns":
        return {"instruction": False, "conversation": False}

    def format_input(self, input: Dict[str, Any]) -> "ChatType":
        messages = (
            [{"role": "user", "content": input["instruction"]}]
            if "instruction" in input
            else input["conversation"]
        )

        if messages[-1]["role"] != "user":
            raise ValueError("The last message must be from the user.")

        return [{"role": "user", "content": self._template.render(messages=messages)}]

    @property
    def outputs(self) -> "StepColumns":
        return ["generation", "model_name"]

    def format_output(
        self, output: Union[str, None], input: Union[Dict[str, Any], None] = None
    ) -> Dict[str, Any]:
        if output is None:
            return {"generation": None}

        response = output.split("\n\n# User")[0]
        if response.startswith("\n\n"):
            response = response[2:]
        response = response.strip()

        return {"generation": response}
load()

加载给定 aspect 的 Jinja2 模板。

源代码位于 src/distilabel/steps/tasks/urial.py
def load(self) -> None:
    """Loads the Jinja2 template for the given `aspect`."""
    super().load()

    _path = str(
        importlib_resources.files("distilabel")
        / "steps"
        / "tasks"
        / "templates"
        / "urial.jinja2"
    )

    self._template = Template(open(_path).read())

task(inputs=None, outputs=None)

从格式化输出函数创建 Task

参数

名称 类型 描述 默认值
inputs Union[StepColumns, None]

包含输入列/键名称的列表,或一个字典,其中键是列,值是布尔值,指示步骤是否需要该列。如果未提供,则默认值将为空列表 [],并且将假定该步骤不需要任何特定列。默认为 None

None
outputs Union[StepColumns, None]

包含输出列/键名称的列表,或一个字典,其中键是列,值是布尔值,指示是否将生成该列。如果未提供,则默认值将为空列表 [],并且将假定该步骤不需要任何特定列。默认为 None

None
源代码位于 src/distilabel/steps/tasks/decorator.py
def task(
    inputs: Union["StepColumns", None] = None,
    outputs: Union["StepColumns", None] = None,
) -> Callable[..., Type["Task"]]:
    """Creates a `Task` from a formatting output function.

    Args:
        inputs: a list containing the name of the inputs columns/keys or a dictionary
            where the keys are the columns and the values are booleans indicating whether
            the column is required or not, that are required by the step. If not provided
            the default will be an empty list `[]` and it will be assumed that the step
            doesn't need any specific columns. Defaults to `None`.
        outputs: a list containing the name of the outputs columns/keys or a dictionary
            where the keys are the columns and the values are booleans indicating whether
            the column will be generated or not. If not provided the default will be an
            empty list `[]` and it will be assumed that the step doesn't need any specific
            columns. Defaults to `None`.
    """

    inputs = inputs or []
    outputs = outputs or []

    def decorator(func: TaskFormattingOutputFunc) -> Type["Task"]:
        doc = inspect.getdoc(func)
        if doc is None:
            raise DistilabelUserError(
                "When using the `task` decorator, including a docstring in the formatting"
                " function is mandatory. The docstring must follow the format described"
                " in the documentation.",
                page="",
            )

        system_prompt, user_message_template = _parse_docstring(doc)
        _validate_templates(inputs, system_prompt, user_message_template)

        def inputs_property(self) -> "StepColumns":
            return inputs

        def outputs_property(self) -> "StepColumns":
            return outputs

        def format_input(self, input: Dict[str, Any]) -> "FormattedInput":
            return [
                {"role": "system", "content": system_prompt.format(**input)},
                {"role": "user", "content": user_message_template.format(**input)},
            ]

        def format_output(
            self, output: Union[str, None], input: Union[Dict[str, Any], None] = None
        ) -> Dict[str, Any]:
            return func(output, input)

        return type(
            func.__name__,
            (Task,),
            {
                "inputs": property(inputs_property),
                "outputs": property(outputs_property),
                "__module__": func.__module__,
                "format_input": format_input,
                "format_output": format_output,
            },
        )

    return decorator