跳到内容

GeneratorStep

本节包含 GeneratorStep 类的 API 参考。

有关如何使用现有生成器步骤或创建自定义步骤的更多信息和示例,请参阅 教程 - 步骤 - GeneratorStep

GeneratorStep

基类: _Step, ABC

一种特殊的 Step 类型,能够生成数据,即它不接收来自先前步骤的任何输入。

属性

名称 类型 描述
batch_size RuntimeParameter[int]

步骤生成的批次将包含的行数。默认为 50

运行时参数
  • batch_size: 步骤生成的批次将包含的行数。默认为 50
源代码位于 src/distilabel/steps/base.py
class GeneratorStep(_Step, ABC):
    """A special kind of `Step` that is able to generate data i.e. it doesn't receive
    any input from the previous steps.

    Attributes:
        batch_size: The number of rows that will contain the batches generated by the
            step. Defaults to `50`.

    Runtime parameters:
        - `batch_size`: The number of rows that will contain the batches generated by
            the step. Defaults to `50`.
    """

    batch_size: RuntimeParameter[int] = Field(
        default=50,
        description="The number of rows that will contain the batches generated by the"
        " step.",
    )

    @abstractmethod
    def process(self, offset: int = 0) -> "GeneratorStepOutput":
        """Method that defines the generation logic of the step. It should yield the
        output rows and a boolean indicating if it's the last batch or not.

        Args:
            offset: The offset to start the generation from. Defaults to 0.

        Yields:
            The output rows and a boolean indicating if it's the last batch or not.
        """
        pass

    def process_applying_mappings(self, offset: int = 0) -> "GeneratorStepOutput":
        """Runs the `process` method of the step applying the `outputs_mappings` to the
        output rows. This is the function that should be used to run the generation logic
        of the step.

        Args:
            offset: The offset to start the generation from. Defaults to 0.

        Yields:
            The output rows and a boolean indicating if it's the last batch or not.
        """

        # If the `Step` was built using the `@step` decorator, then we need to pass
        # the runtime parameters as `kwargs`, so they can be used within the processing
        # function
        generator = (
            self.process(offset=offset)
            if not self._built_from_decorator
            else self.process(offset=offset, **self._runtime_parameters)
        )

        for output_rows, last_batch in generator:
            yield (
                [
                    {self.output_mappings.get(k, k): v for k, v in row.items()}
                    for row in output_rows
                ],
                last_batch,
            )

process(offset=0) abstractmethod

定义步骤生成逻辑的方法。它应该产生输出行和一个布尔值,指示是否是最后一个批次。

参数

名称 类型 描述 默认值
offset int

开始生成的偏移量。默认为 0。

0

Yields

类型 描述
GeneratorStepOutput

输出行和一个布尔值,指示是否是最后一个批次。

源代码位于 src/distilabel/steps/base.py
@abstractmethod
def process(self, offset: int = 0) -> "GeneratorStepOutput":
    """Method that defines the generation logic of the step. It should yield the
    output rows and a boolean indicating if it's the last batch or not.

    Args:
        offset: The offset to start the generation from. Defaults to 0.

    Yields:
        The output rows and a boolean indicating if it's the last batch or not.
    """
    pass

process_applying_mappings(offset=0)

运行步骤的 process 方法,并将 outputs_mappings 应用于输出行。此函数应用于运行步骤的生成逻辑。

参数

名称 类型 描述 默认值
offset int

开始生成的偏移量。默认为 0。

0

Yields

类型 描述
GeneratorStepOutput

输出行和一个布尔值,指示是否是最后一个批次。

源代码位于 src/distilabel/steps/base.py
def process_applying_mappings(self, offset: int = 0) -> "GeneratorStepOutput":
    """Runs the `process` method of the step applying the `outputs_mappings` to the
    output rows. This is the function that should be used to run the generation logic
    of the step.

    Args:
        offset: The offset to start the generation from. Defaults to 0.

    Yields:
        The output rows and a boolean indicating if it's the last batch or not.
    """

    # If the `Step` was built using the `@step` decorator, then we need to pass
    # the runtime parameters as `kwargs`, so they can be used within the processing
    # function
    generator = (
        self.process(offset=offset)
        if not self._built_from_decorator
        else self.process(offset=offset, **self._runtime_parameters)
    )

    for output_rows, last_batch in generator:
        yield (
            [
                {self.output_mappings.get(k, k): v for k, v in row.items()}
                for row in output_rows
            ],
            last_batch,
        )

make_generator_step(dataset, pipeline=None, batch_size=50, input_mappings=None, output_mappings=None, resources=StepResources(), repo_id='default_name')

用于从数据集创建 GeneratorStep 的辅助方法,以简化操作

参数

名称 类型 描述 默认值
dataset Union[Dataset, DataFrame, List[Dict[str, str]]]

要在 Pipeline 中使用的数据集。

必需
batch_size int

batch_size,将默认为 GeneratorStep 使用的相同值。默认为 50

50
input_mappings Optional[Dict[str, str]]

应用与任何其他步骤相同。默认为 None

None
output_mappings Optional[Dict[str, str]]

应用与任何其他步骤相同。默认为 None

None
resources StepResources

应用与任何其他步骤相同。默认为 StepResources()

StepResources()
repo_id Optional[str]

要在 LoadDataFromHub 步骤中使用的仓库 ID。 这应该不是必要的,但如果发生错误,数据集将尝试在内部使用 load_dataset 加载。 如果发生这种情况,将使用 repo_id

'default_name'

Raises

类型 描述
ValueError

如果格式与支持的格式不同。

Returns

类型 描述
GeneratorStep

如果输入是字典列表,则返回 LoadDataFromDicts,或者如果输入是 pd.DataFrameDataset,则返回 LoadDataFromHub 实例。

GeneratorStep

如果输入是 pd.DataFrameDataset

源代码位于 src/distilabel/steps/generators/utils.py
def make_generator_step(
    dataset: Union[Dataset, pd.DataFrame, List[Dict[str, str]]],
    pipeline: Union["BasePipeline", None] = None,
    batch_size: int = 50,
    input_mappings: Optional[Dict[str, str]] = None,
    output_mappings: Optional[Dict[str, str]] = None,
    resources: StepResources = StepResources(),
    repo_id: Optional[str] = "default_name",
) -> "GeneratorStep":
    """Helper method to create a `GeneratorStep` from a dataset, to simplify

    Args:
        dataset: The dataset to use in the `Pipeline`.
        batch_size: The batch_size, will default to the same used by the `GeneratorStep`s.
            Defaults to `50`.
        input_mappings: Applies the same as any other step. Defaults to `None`.
        output_mappings: Applies the same as any other step. Defaults to `None`.
        resources: Applies the same as any other step. Defaults to `StepResources()`.
        repo_id: The repository ID to use in the `LoadDataFromHub` step.
            This shouldn't be necessary, but in case of error, the dataset will try to be loaded
            using `load_dataset` internally. If that case happens, the `repo_id` will be used.

    Raises:
        ValueError: If the format is different from the ones supported.

    Returns:
        A `LoadDataFromDicts` if the input is a list of dicts, or `LoadDataFromHub` instance
        if the input is a `pd.DataFrame` or a `Dataset`.
    """
    from distilabel.steps import LoadDataFromDicts, LoadDataFromHub

    if isinstance(dataset, list):
        return LoadDataFromDicts(
            pipeline=pipeline,
            data=dataset,
            batch_size=batch_size,
            input_mappings=input_mappings or {},
            output_mappings=output_mappings or {},
            resources=resources,
        )

    if isinstance(dataset, pd.DataFrame):
        dataset = Dataset.from_pandas(dataset, preserve_index=False)

    if not isinstance(dataset, Dataset):
        raise DistilabelUserError(
            f"Dataset type not allowed: {type(dataset)}, must be one of: "
            "`datasets.Dataset`, `pd.DataFrame`, `List[Dict[str, str]]`",
            page="sections/how_to_guides/basic/pipeline/?h=make_#__tabbed_1_2",
        )

    loader = LoadDataFromHub(
        pipeline=pipeline,
        repo_id=repo_id,
        batch_size=batch_size,
        input_mappings=input_mappings or {},
        output_mappings=output_mappings or {},
        resources=resources,
    )
    super(loader.__class__, loader).load()  # Ensure the logger is loaded
    loader._dataset = dataset
    loader.num_examples = len(dataset)
    loader._dataset_info = {"default": dataset.info}
    return loader