UltraFeedback¶
使用 LLM
对关注不同方面的生成结果进行排序。
UltraFeedback:通过高质量反馈提升语言模型。
属性¶
- aspect: 使用
UltraFeedback
模型执行的方面。可用的方面包括: -helpfulness
: 基于乐于助人程度评估文本输出。 -honesty
: 基于诚实度评估文本输出。 -instruction-following
: 基于给定的指令评估文本输出。 -truthfulness
: 基于真实性评估文本输出。 此外,Argilla 定义了一个自定义方面,以便在单个提示中评估文本输出的总体评估。 自定义方面是: -overall-rating
: 基于总体评估来评估文本输出。 默认为"overall-rating"
。
输入 & 输出列¶
graph TD
subgraph Dataset
subgraph Columns
ICOL0[instruction]
ICOL1[generations]
end
subgraph New columns
OCOL0[ratings]
OCOL1[rationales]
OCOL2[model_name]
end
end
subgraph UltraFeedback
StepInput[Input Columns: instruction, generations]
StepOutput[Output Columns: ratings, rationales, model_name]
end
ICOL0 --> StepInput
ICOL1 --> StepInput
StepOutput --> OCOL0
StepOutput --> OCOL1
StepOutput --> OCOL2
StepInput --> StepOutput
输入¶
-
instruction (
str
): 用于评估文本输出的参考指令。 -
generations (
List[str]
): 要根据给定指令评估的文本输出。
输出¶
-
ratings (
List[float]
): 每个提供的文本输出的评分。 -
rationales (
List[str]
): 每个提供的文本输出的理由。 -
model_name (
str
): 用于生成评分和理由的模型的名称。
示例¶
根据所选方面对来自不同 LLM 的生成结果进行评分¶
from distilabel.steps.tasks import UltraFeedback
from distilabel.models import InferenceEndpointsLLM
# Consider this as a placeholder for your actual LLM.
ultrafeedback = UltraFeedback(
llm=InferenceEndpointsLLM(
model_id="mistralai/Mistral-7B-Instruct-v0.2",
),
use_default_structured_output=False
)
ultrafeedback.load()
result = next(
ultrafeedback.process(
[
{
"instruction": "How much is 2+2?",
"generations": ["4", "and a car"],
}
]
)
)
# result
# [
# {
# 'instruction': 'How much is 2+2?',
# 'generations': ['4', 'and a car'],
# 'ratings': [1, 2],
# 'rationales': ['explanation for 4', 'explanation for and a car'],
# 'model_name': 'mistralai/Mistral-7B-Instruct-v0.2',
# }
# ]
根据诚实度对来自不同 LLM 的生成结果进行评分,使用默认的结构化输出¶
from distilabel.steps.tasks import UltraFeedback
from distilabel.models import InferenceEndpointsLLM
# Consider this as a placeholder for your actual LLM.
ultrafeedback = UltraFeedback(
llm=InferenceEndpointsLLM(
model_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
),
aspect="honesty"
)
ultrafeedback.load()
result = next(
ultrafeedback.process(
[
{
"instruction": "How much is 2+2?",
"generations": ["4", "and a car"],
}
]
)
)
# result
# [{'instruction': 'How much is 2+2?',
# 'generations': ['4', 'and a car'],
# 'ratings': [5, 1],
# 'rationales': ['The response is correct and confident, as it directly answers the question without expressing any uncertainty or doubt.',
# "The response is confidently incorrect, as it provides unrelated information ('a car') and does not address the question. The model shows no uncertainty or indication that it does not know the answer."],
# 'distilabel_metadata': {'raw_output_ultra_feedback_0': '{"ratings": [\n 5,\n 1\n] \n\n,"rationales": [\n "The response is correct and confident, as it directly answers the question without expressing any uncertainty or doubt.",\n "The response is confidently incorrect, as it provides unrelated information ('a car') and does not address the question. The model shows no uncertainty or indication that it does not know the answer."\n] }'},
# 'model_name': 'meta-llama/Meta-Llama-3.1-70B-Instruct'}]
根据乐于助人程度对来自不同 LLM 的生成结果进行评分,使用默认的结构化输出¶
from distilabel.steps.tasks import UltraFeedback
from distilabel.models import InferenceEndpointsLLM
# Consider this as a placeholder for your actual LLM.
ultrafeedback = UltraFeedback(
llm=InferenceEndpointsLLM(
model_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
generation_kwargs={"max_new_tokens": 512},
),
aspect="helpfulness"
)
ultrafeedback.load()
result = next(
ultrafeedback.process(
[
{
"instruction": "How much is 2+2?",
"generations": ["4", "and a car"],
}
]
)
)
# result
# [{'instruction': 'How much is 2+2?',
# 'generations': ['4', 'and a car'],
# 'ratings': [1, 5],
# 'rationales': ['Text 1 is clear and relevant, providing the correct answer to the question. It is also not lengthy and does not contain repetition. However, it lacks comprehensive information or detailed description.',
# 'Text 2 is neither clear nor relevant to the task. It does not provide any useful information and seems unrelated to the question.'],
# 'rationales_for_rating': ['Text 1 is rated as Correct (3) because it provides the accurate answer to the question, but lacks comprehensive information or detailed description.',
# 'Text 2 is rated as Severely Incorrect (1) because it does not provide any relevant information and seems unrelated to the question.'],
# 'types': [1, 3, 1],
# 'distilabel_metadata': {'raw_output_ultra_feedback_0': '{ \n "ratings": [\n 1,\n 5\n ]\n ,\n "rationales": [\n "Text 1 is clear and relevant, providing the correct answer to the question. It is also not lengthy and does not contain repetition. However, it lacks comprehensive information or detailed description.",\n "Text 2 is neither clear nor relevant to the task. It does not provide any useful information and seems unrelated to the question."\n ]\n ,\n "rationales_for_rating": [\n "Text 1 is rated as Correct (3) because it provides the accurate answer to the question, but lacks comprehensive information or detailed description.",\n "Text 2 is rated as Severely Incorrect (1) because it does not provide any relevant information and seems unrelated to the question."\n ]\n ,\n "types": [\n 1, 3,\n 1\n ]\n }'},
# 'model_name': 'meta-llama/Meta-Llama-3.1-70B-Instruct'}]