跳到内容

MergeColumns

合并行中的列。

MergeColumns 是一个 Step,它实现了 process 方法,该方法调用 merge_columns 函数来处理和合并 StepInput 中的列。MergeColumns 提供了两个属性 columnsoutput_column,用于指定要合并的列和结果输出列。

This step can be useful if you have a `Task` that generates instructions for example, and you
want to have more examples of those. In such a case, you could for example use another `Task`
to multiply your instructions synthetically, what would yield two different columns splitted.
Using `MergeColumns` you can merge them and use them as a single column in your dataset for
further processing.

Attributes

  • columns: 包含要合并的列的名称的字符串列表。

  • output_column: 输出列的字符串名称

Input & Output Columns

graph TD
    subgraph Dataset
        subgraph Columns
            ICOL0[dynamic]
        end
        subgraph New columns
            OCOL0[dynamic]
        end
    end

    subgraph MergeColumns
        StepInput[Input Columns: dynamic]
        StepOutput[Output Columns: dynamic]
    end

    ICOL0 --> StepInput
    StepOutput --> OCOL0
    StepInput --> StepOutput

Inputs

  • dynamic (由 columns 属性决定): 要合并的列。

Outputs

  • dynamic (由 columnsoutput_column 属性决定): 合并后的列。

Examples

合并数据集行中的列

from distilabel.steps import MergeColumns

combiner = MergeColumns(
    columns=["queries", "multiple_queries"],
    output_column="queries",
)
combiner.load()

result = next(
    combiner.process(
        [
            {
                "queries": "How are you?",
                "multiple_queries": ["What's up?", "Everything ok?"]
            }
        ],
    )
)
# >>> result
# [{'queries': ['How are you?', "What's up?", 'Everything ok?']}]