TextClustering¶

任务：对一组文本进行聚类，并为每个集群生成摘要标签。

这是一个继承自 TextClassification 的 GlobalTask，这意味着该类的所有属性在此处都可用。此外，在这种情况下，我们一次处理所有输入，而不是使用批次。 input_batch_size 在这里用于分批次将示例发送到 LLM（与更常见的 Task 定义略有不同）。该任务在每个集群中查找给定数量的代表性示例（数量由 samples_per_cluster 属性设置），并将它们发送到 LLM 以获取代表集群的标签。然后，标签被分配给集群中的每个文本。在此步骤中使用的集群和投影假定是从 UMAP + DBSCAN 步骤获得的，但可以为类似的步骤生成，只要它们代表相同的概念即可。此步骤运行类似于此存储库中的 pipeline：https://github.com/huggingface/text-clustering

属性¶

savefig: 是否生成并保存包含文本聚类结果的图形。 - samples_per_cluster: 在 LLM 中用作集群样本的示例数量。

输入和输出列¶

graph TD
    subgraph Dataset
        subgraph Columns
            ICOL0[text]
            ICOL1[projection]
            ICOL2[cluster_label]
        end
        subgraph New columns
            OCOL0[summary_label]
            OCOL1[model_name]
        end
    end

    subgraph TextClustering
        StepInput[Input Columns: text, projection, cluster_label]
        StepOutput[Output Columns: summary_label, model_name]
    end

    ICOL0 --> StepInput
    ICOL1 --> StepInput
    ICOL2 --> StepInput
    StepOutput --> OCOL0
    StepOutput --> OCOL1
    StepInput --> StepOutput

输入¶

text (str): 我们要获取标签的参考文本。
projection (List[float]): 要聚类的文本的向量表示，通常是 UMAP 步骤的输出。
cluster_label (int): 表示给定集群标签的整数。 -1 表示未聚类。

输出¶

summary_label (str): 文本的标签或标签列表。
model_name (str): 用于生成标签的模型名称。

示例¶

使用聚类为一组文本生成标签¶

from distilabel.models import InferenceEndpointsLLM
from distilabel.steps import UMAP, DBSCAN, TextClustering
from distilabel.pipeline import Pipeline

ds_name = "argilla-warehouse/personahub-fineweb-edu-4-clustering-100k"

with Pipeline(name="Text clustering dataset") as pipeline:
    batch_size = 500

    ds = load_dataset(ds_name, split="train").select(range(10000))
    loader = make_generator_step(ds, batch_size=batch_size, repo_id=ds_name)

    umap = UMAP(n_components=2, metric="cosine")
    dbscan = DBSCAN(eps=0.3, min_samples=30)

    text_clustering = TextClustering(
        llm=InferenceEndpointsLLM(
            model_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
            tokenizer_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
        ),
        n=3,  # 3 labels per example
        query_title="Examples of Personas",
        samples_per_cluster=10,
        context=(
            "Describe the main themes, topics, or categories that could describe the "
            "following types of personas. All the examples of personas must share "
            "the same set of labels."
        ),
        default_label="None",
        savefig=True,
        input_batch_size=8,
        input_mappings={"text": "persona"},
        use_default_structured_output=True,
    )

    loader >> umap >> dbscan >> text_clustering

参考¶

text-clustering 仓库