GeneratorStep¶

本节包含 GeneratorStep 类的 API 参考。

有关如何使用现有生成器步骤或创建自定义步骤的更多信息和示例，请参阅教程 - 步骤 - GeneratorStep。

`GeneratorStep` ¶

基类: _Step, ABC

一种特殊的 Step 类型，能够生成数据，即它不接收来自先前步骤的任何输入。

属性

名称	类型	描述
`batch_size`	`RuntimeParameter[int]`	步骤生成的批次将包含的行数。默认为 `50`。

运行时参数

batch_size: 步骤生成的批次将包含的行数。默认为 50。

源代码位于 src/distilabel/steps/base.py

class GeneratorStep(_Step, ABC):
    """A special kind of `Step` that is able to generate data i.e. it doesn't receive
    any input from the previous steps.

    Attributes:
        batch_size: The number of rows that will contain the batches generated by the
            step. Defaults to `50`.

    Runtime parameters:
        - `batch_size`: The number of rows that will contain the batches generated by
            the step. Defaults to `50`.
    """

    batch_size: RuntimeParameter[int] = Field(
        default=50,
        description="The number of rows that will contain the batches generated by the"
        " step.",
    )

    @abstractmethod
    def process(self, offset: int = 0) -> "GeneratorStepOutput":
        """Method that defines the generation logic of the step. It should yield the
        output rows and a boolean indicating if it's the last batch or not.

        Args:
            offset: The offset to start the generation from. Defaults to 0.

        Yields:
            The output rows and a boolean indicating if it's the last batch or not.
        """
        pass

    def process_applying_mappings(self, offset: int = 0) -> "GeneratorStepOutput":
        """Runs the `process` method of the step applying the `outputs_mappings` to the
        output rows. This is the function that should be used to run the generation logic
        of the step.

        Args:
            offset: The offset to start the generation from. Defaults to 0.

        Yields:
            The output rows and a boolean indicating if it's the last batch or not.
        """

        # If the `Step` was built using the `@step` decorator, then we need to pass
        # the runtime parameters as `kwargs`, so they can be used within the processing
        # function
        generator = (
            self.process(offset=offset)
            if not self._built_from_decorator
            else self.process(offset=offset, **self._runtime_parameters)
        )

        for output_rows, last_batch in generator:
            yield (
                [
                    {self.output_mappings.get(k, k): v for k, v in row.items()}
                    for row in output_rows
                ],
                last_batch,
            )

`process(offset=0)` `abstractmethod` ¶

定义步骤生成逻辑的方法。它应该产生输出行和一个布尔值，指示是否是最后一个批次。

参数

名称	类型	描述	默认值
`offset`	`int`	开始生成的偏移量。默认为 0。	`0`

Yields

类型	描述
`GeneratorStepOutput`	输出行和一个布尔值，指示是否是最后一个批次。

源代码位于 src/distilabel/steps/base.py

@abstractmethod
def process(self, offset: int = 0) -> "GeneratorStepOutput":
    """Method that defines the generation logic of the step. It should yield the
    output rows and a boolean indicating if it's the last batch or not.

    Args:
        offset: The offset to start the generation from. Defaults to 0.

    Yields:
        The output rows and a boolean indicating if it's the last batch or not.
    """
    pass

`process_applying_mappings(offset=0)` ¶

运行步骤的 process 方法，并将 outputs_mappings 应用于输出行。此函数应用于运行步骤的生成逻辑。

参数

名称	类型	描述	默认值
`offset`	`int`	开始生成的偏移量。默认为 0。	`0`

Yields

类型	描述
`GeneratorStepOutput`	输出行和一个布尔值，指示是否是最后一个批次。

源代码位于 src/distilabel/steps/base.py

def process_applying_mappings(self, offset: int = 0) -> "GeneratorStepOutput":
    """Runs the `process` method of the step applying the `outputs_mappings` to the
    output rows. This is the function that should be used to run the generation logic
    of the step.

    Args:
        offset: The offset to start the generation from. Defaults to 0.

    Yields:
        The output rows and a boolean indicating if it's the last batch or not.
    """

    # If the `Step` was built using the `@step` decorator, then we need to pass
    # the runtime parameters as `kwargs`, so they can be used within the processing
    # function
    generator = (
        self.process(offset=offset)
        if not self._built_from_decorator
        else self.process(offset=offset, **self._runtime_parameters)
    )

    for output_rows, last_batch in generator:
        yield (
            [
                {self.output_mappings.get(k, k): v for k, v in row.items()}
                for row in output_rows
            ],
            last_batch,
        )

`make_generator_step(dataset, pipeline=None, batch_size=50, input_mappings=None, output_mappings=None, resources=StepResources(), repo_id='default_name')` ¶

用于从数据集创建 GeneratorStep 的辅助方法，以简化操作

参数

名称	类型	描述	默认值
`dataset`	`Union[Dataset, DataFrame, List[Dict[str, str]]]`	要在 `Pipeline` 中使用的数据集。	必需
`batch_size`	`int`	batch_size，将默认为 `GeneratorStep` 使用的相同值。默认为 `50`。	`50`
`input_mappings`	`Optional[Dict[str, str]]`	应用与任何其他步骤相同。默认为 `None`。	`None`
`output_mappings`	`Optional[Dict[str, str]]`	应用与任何其他步骤相同。默认为 `None`。	`None`
`resources`	`StepResources`	应用与任何其他步骤相同。默认为 `StepResources()`。	`StepResources()`
`repo_id`	`Optional[str]`	要在 `LoadDataFromHub` 步骤中使用的仓库 ID。这应该不是必要的，但如果发生错误，数据集将尝试在内部使用 `load_dataset` 加载。如果发生这种情况，将使用 `repo_id`。	`'default_name'`

Raises

类型	描述
`ValueError`	如果格式与支持的格式不同。

Returns

类型	描述
`GeneratorStep`	如果输入是字典列表，则返回 `LoadDataFromDicts`，或者如果输入是 `pd.DataFrame` 或 `Dataset`，则返回 `LoadDataFromHub` 实例。
`GeneratorStep`	如果输入是 `pd.DataFrame` 或 `Dataset`。

源代码位于 src/distilabel/steps/generators/utils.py

def make_generator_step(
    dataset: Union[Dataset, pd.DataFrame, List[Dict[str, str]]],
    pipeline: Union["BasePipeline", None] = None,
    batch_size: int = 50,
    input_mappings: Optional[Dict[str, str]] = None,
    output_mappings: Optional[Dict[str, str]] = None,
    resources: StepResources = StepResources(),
    repo_id: Optional[str] = "default_name",
) -> "GeneratorStep":
    """Helper method to create a `GeneratorStep` from a dataset, to simplify

    Args:
        dataset: The dataset to use in the `Pipeline`.
        batch_size: The batch_size, will default to the same used by the `GeneratorStep`s.
            Defaults to `50`.
        input_mappings: Applies the same as any other step. Defaults to `None`.
        output_mappings: Applies the same as any other step. Defaults to `None`.
        resources: Applies the same as any other step. Defaults to `StepResources()`.
        repo_id: The repository ID to use in the `LoadDataFromHub` step.
            This shouldn't be necessary, but in case of error, the dataset will try to be loaded
            using `load_dataset` internally. If that case happens, the `repo_id` will be used.

    Raises:
        ValueError: If the format is different from the ones supported.

    Returns:
        A `LoadDataFromDicts` if the input is a list of dicts, or `LoadDataFromHub` instance
        if the input is a `pd.DataFrame` or a `Dataset`.
    """
    from distilabel.steps import LoadDataFromDicts, LoadDataFromHub

    if isinstance(dataset, list):
        return LoadDataFromDicts(
            pipeline=pipeline,
            data=dataset,
            batch_size=batch_size,
            input_mappings=input_mappings or {},
            output_mappings=output_mappings or {},
            resources=resources,
        )

    if isinstance(dataset, pd.DataFrame):
        dataset = Dataset.from_pandas(dataset, preserve_index=False)

    if not isinstance(dataset, Dataset):
        raise DistilabelUserError(
            f"Dataset type not allowed: {type(dataset)}, must be one of: "
            "`datasets.Dataset`, `pd.DataFrame`, `List[Dict[str, str]]`",
            page="sections/how_to_guides/basic/pipeline/?h=make_#__tabbed_1_2",
        )

    loader = LoadDataFromHub(
        pipeline=pipeline,
        repo_id=repo_id,
        batch_size=batch_size,
        input_mappings=input_mappings or {},
        output_mappings=output_mappings or {},
        resources=resources,
    )
    super(loader.__class__, loader).load()  # Ensure the logger is loaded
    loader._dataset = dataset
    loader.num_examples = len(dataset)
    loader._dataset_info = {"default": dataset.info}
    return loader

GeneratorStep¶

GeneratorStep ¶

process(offset=0) abstractmethod ¶

process_applying_mappings(offset=0) ¶

make_generator_step(dataset, pipeline=None, batch_size=50, input_mappings=None, output_mappings=None, resources=StepResources(), repo_id='default_name') ¶

`GeneratorStep` ¶

`process(offset=0)` `abstractmethod` ¶

`process_applying_mappings(offset=0)` ¶

`make_generator_step(dataset, pipeline=None, batch_size=50, input_mappings=None, output_mappings=None, resources=StepResources(), repo_id='default_name')` ¶