GenerateSentencePair¶
给定一个锚定句子,生成一个正面和一个负面(可选)句子。
GenerateSentencePair
是一个预定义的任务,给定一个锚定句子,生成一个与锚定句子相关的正面句子,以及可选的一个与锚定句子无关或与之相似的负面句子。可选地,您可以提供上下文来引导 LLM 走向更具体的行为。此任务对于生成训练数据集以训练嵌入模型非常有用。
属性¶
-
triplet:一个标志,指示任务是否应生成三元组句子(锚定句子、正面句子、负面句子)。默认为
False
。 -
action:执行的操作,用于生成正面句子。
-
context:用于生成的上下文。有助于引导 LLM 走向更具体的上下文。默认不使用。
-
hard_negative:一个标志,指示负面句子是否应该是 hard-negative。Hard-negative 使模型难以与正面句子区分,具有更高的语义相似度。
输入和输出列¶
graph TD
subgraph Dataset
subgraph Columns
ICOL0[anchor]
end
subgraph New columns
OCOL0[positive]
OCOL1[negative]
OCOL2[model_name]
end
end
subgraph GenerateSentencePair
StepInput[Input Columns: anchor]
StepOutput[Output Columns: positive, negative, model_name]
end
ICOL0 --> StepInput
StepOutput --> OCOL0
StepOutput --> OCOL1
StepOutput --> OCOL2
StepInput --> StepOutput
输入¶
- anchor (
str
):用于生成正面和负面句子的锚定句子。
输出¶
-
positive (
str
):与anchor
相关的正面句子。 -
negative (
str
):与anchor
无关的负面句子(如果triplet=True
),或者与正面句子更相似,以增加模型区分的难度(如果hard_negative=True
)。 -
model_name (
str
):用于生成句子的模型名称。
示例¶
释义¶
from distilabel.steps.tasks import GenerateSentencePair
from distilabel.models import InferenceEndpointsLLM
generate_sentence_pair = GenerateSentencePair(
triplet=True, # `False` to generate only positive
action="paraphrase",
llm=InferenceEndpointsLLM(
model_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
tokenizer_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
),
input_batch_size=10,
)
generate_sentence_pair.load()
result = generate_sentence_pair.process([{"anchor": "What Game of Thrones villain would be the most likely to give you mercy?"}])
生成语义相似的句子¶
from distilabel.models import InferenceEndpointsLLM
from distilabel.steps.tasks import GenerateSentencePair
generate_sentence_pair = GenerateSentencePair(
triplet=True, # `False` to generate only positive
action="semantically-similar",
llm=InferenceEndpointsLLM(
model_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
tokenizer_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
),
input_batch_size=10,
)
generate_sentence_pair.load()
result = generate_sentence_pair.process([{"anchor": "How does 3D printing work?"}])
生成查询¶
from distilabel.steps.tasks import GenerateSentencePair
from distilabel.models import InferenceEndpointsLLM
generate_sentence_pair = GenerateSentencePair(
triplet=True, # `False` to generate only positive
action="query",
llm=InferenceEndpointsLLM(
model_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
tokenizer_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
),
input_batch_size=10,
)
generate_sentence_pair.load()
result = generate_sentence_pair.process([{"anchor": "Argilla is an open-source data curation platform for LLMs. Using Argilla, ..."}])
生成答案¶
from distilabel.steps.tasks import GenerateSentencePair
from distilabel.models import InferenceEndpointsLLM
generate_sentence_pair = GenerateSentencePair(
triplet=True, # `False` to generate only positive
action="answer",
llm=InferenceEndpointsLLM(
model_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
tokenizer_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
),
input_batch_size=10,
)
generate_sentence_pair.load()
result = generate_sentence_pair.process([{"anchor": "What Game of Thrones villain would be the most likely to give you mercy?"}])
)¶
from distilabel.steps.tasks import GenerateSentencePair
from distilabel.models import InferenceEndpointsLLM
generate_sentence_pair = GenerateSentencePair(
triplet=True, # `False` to generate only positive
action="query",
context="Argilla is an open-source data curation platform for LLMs.",
hard_negative=True,
llm=InferenceEndpointsLLM(
model_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
),
input_batch_size=10,
use_default_structured_output=True
)
generate_sentence_pair.load()
result = generate_sentence_pair.process([{"anchor": "I want to generate queries for my LLM."}])