跳到内容

GenerateSentencePair

给定一个锚定句子,生成一个正面和一个负面(可选)句子。

GenerateSentencePair 是一个预定义的任务,给定一个锚定句子,生成一个与锚定句子相关的正面句子,以及可选的一个与锚定句子无关或与之相似的负面句子。可选地,您可以提供上下文来引导 LLM 走向更具体的行为。此任务对于生成训练数据集以训练嵌入模型非常有用。

属性

  • triplet:一个标志,指示任务是否应生成三元组句子(锚定句子、正面句子、负面句子)。默认为 False

  • action:执行的操作,用于生成正面句子。

  • context:用于生成的上下文。有助于引导 LLM 走向更具体的上下文。默认不使用。

  • hard_negative:一个标志,指示负面句子是否应该是 hard-negative。Hard-negative 使模型难以与正面句子区分,具有更高的语义相似度。

输入和输出列

graph TD
    subgraph Dataset
        subgraph Columns
            ICOL0[anchor]
        end
        subgraph New columns
            OCOL0[positive]
            OCOL1[negative]
            OCOL2[model_name]
        end
    end

    subgraph GenerateSentencePair
        StepInput[Input Columns: anchor]
        StepOutput[Output Columns: positive, negative, model_name]
    end

    ICOL0 --> StepInput
    StepOutput --> OCOL0
    StepOutput --> OCOL1
    StepOutput --> OCOL2
    StepInput --> StepOutput

输入

  • anchor (str):用于生成正面和负面句子的锚定句子。

输出

  • positive (str):与 anchor 相关的正面句子。

  • negative (str):与 anchor 无关的负面句子(如果 triplet=True),或者与正面句子更相似,以增加模型区分的难度(如果 hard_negative=True)。

  • model_name (str):用于生成句子的模型名称。

示例

释义

from distilabel.steps.tasks import GenerateSentencePair
from distilabel.models import InferenceEndpointsLLM

generate_sentence_pair = GenerateSentencePair(
    triplet=True, # `False` to generate only positive
    action="paraphrase",
    llm=InferenceEndpointsLLM(
        model_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
        tokenizer_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
    ),
    input_batch_size=10,
)

generate_sentence_pair.load()

result = generate_sentence_pair.process([{"anchor": "What Game of Thrones villain would be the most likely to give you mercy?"}])

生成语义相似的句子

from distilabel.models import InferenceEndpointsLLM
from distilabel.steps.tasks import GenerateSentencePair

generate_sentence_pair = GenerateSentencePair(
    triplet=True, # `False` to generate only positive
    action="semantically-similar",
    llm=InferenceEndpointsLLM(
        model_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
        tokenizer_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
    ),
    input_batch_size=10,
)

generate_sentence_pair.load()

result = generate_sentence_pair.process([{"anchor": "How does 3D printing work?"}])

生成查询

from distilabel.steps.tasks import GenerateSentencePair
from distilabel.models import InferenceEndpointsLLM

generate_sentence_pair = GenerateSentencePair(
    triplet=True, # `False` to generate only positive
    action="query",
    llm=InferenceEndpointsLLM(
        model_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
        tokenizer_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
    ),
    input_batch_size=10,
)

generate_sentence_pair.load()

result = generate_sentence_pair.process([{"anchor": "Argilla is an open-source data curation platform for LLMs. Using Argilla, ..."}])

生成答案

from distilabel.steps.tasks import GenerateSentencePair
from distilabel.models import InferenceEndpointsLLM

generate_sentence_pair = GenerateSentencePair(
    triplet=True, # `False` to generate only positive
    action="answer",
    llm=InferenceEndpointsLLM(
        model_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
        tokenizer_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
    ),
    input_batch_size=10,
)

generate_sentence_pair.load()

result = generate_sentence_pair.process([{"anchor": "What Game of Thrones villain would be the most likely to give you mercy?"}])

)

from distilabel.steps.tasks import GenerateSentencePair
from distilabel.models import InferenceEndpointsLLM

generate_sentence_pair = GenerateSentencePair(
    triplet=True, # `False` to generate only positive
    action="query",
    context="Argilla is an open-source data curation platform for LLMs.",
    hard_negative=True,
    llm=InferenceEndpointsLLM(
        model_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
    ),
    input_batch_size=10,
    use_default_structured_output=True
)

generate_sentence_pair.load()

result = generate_sentence_pair.process([{"anchor": "I want to generate queries for my LLM."}])