跳到内容

UltraFeedback

使用 LLM 对关注不同方面的生成结果进行排序。

UltraFeedback:通过高质量反馈提升语言模型。

属性

  • aspect: 使用 UltraFeedback 模型执行的方面。可用的方面包括: - helpfulness: 基于乐于助人程度评估文本输出。 - honesty: 基于诚实度评估文本输出。 - instruction-following: 基于给定的指令评估文本输出。 - truthfulness: 基于真实性评估文本输出。 此外,Argilla 定义了一个自定义方面,以便在单个提示中评估文本输出的总体评估。 自定义方面是: - overall-rating: 基于总体评估来评估文本输出。 默认为 "overall-rating"

输入 & 输出列

graph TD
    subgraph Dataset
        subgraph Columns
            ICOL0[instruction]
            ICOL1[generations]
        end
        subgraph New columns
            OCOL0[ratings]
            OCOL1[rationales]
            OCOL2[model_name]
        end
    end

    subgraph UltraFeedback
        StepInput[Input Columns: instruction, generations]
        StepOutput[Output Columns: ratings, rationales, model_name]
    end

    ICOL0 --> StepInput
    ICOL1 --> StepInput
    StepOutput --> OCOL0
    StepOutput --> OCOL1
    StepOutput --> OCOL2
    StepInput --> StepOutput

输入

  • instruction (str): 用于评估文本输出的参考指令。

  • generations (List[str]): 要根据给定指令评估的文本输出。

输出

  • ratings (List[float]): 每个提供的文本输出的评分。

  • rationales (List[str]): 每个提供的文本输出的理由。

  • model_name (str): 用于生成评分和理由的模型的名称。

示例

根据所选方面对来自不同 LLM 的生成结果进行评分

from distilabel.steps.tasks import UltraFeedback
from distilabel.models import InferenceEndpointsLLM

# Consider this as a placeholder for your actual LLM.
ultrafeedback = UltraFeedback(
    llm=InferenceEndpointsLLM(
        model_id="mistralai/Mistral-7B-Instruct-v0.2",
    ),
    use_default_structured_output=False
)

ultrafeedback.load()

result = next(
    ultrafeedback.process(
        [
            {
                "instruction": "How much is 2+2?",
                "generations": ["4", "and a car"],
            }
        ]
    )
)
# result
# [
#     {
#         'instruction': 'How much is 2+2?',
#         'generations': ['4', 'and a car'],
#         'ratings': [1, 2],
#         'rationales': ['explanation for 4', 'explanation for and a car'],
#         'model_name': 'mistralai/Mistral-7B-Instruct-v0.2',
#     }
# ]

根据诚实度对来自不同 LLM 的生成结果进行评分,使用默认的结构化输出

from distilabel.steps.tasks import UltraFeedback
from distilabel.models import InferenceEndpointsLLM

# Consider this as a placeholder for your actual LLM.
ultrafeedback = UltraFeedback(
    llm=InferenceEndpointsLLM(
        model_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
    ),
    aspect="honesty"
)

ultrafeedback.load()

result = next(
    ultrafeedback.process(
        [
            {
                "instruction": "How much is 2+2?",
                "generations": ["4", "and a car"],
            }
        ]
    )
)
# result
# [{'instruction': 'How much is 2+2?',
# 'generations': ['4', 'and a car'],
# 'ratings': [5, 1],
# 'rationales': ['The response is correct and confident, as it directly answers the question without expressing any uncertainty or doubt.',
# "The response is confidently incorrect, as it provides unrelated information ('a car') and does not address the question. The model shows no uncertainty or indication that it does not know the answer."],
# 'distilabel_metadata': {'raw_output_ultra_feedback_0': '{"ratings": [\n    5,\n    1\n] \n\n,"rationales": [\n    "The response is correct and confident, as it directly answers the question without expressing any uncertainty or doubt.",\n    "The response is confidently incorrect, as it provides unrelated information ('a car') and does not address the question. The model shows no uncertainty or indication that it does not know the answer."\n] }'},
# 'model_name': 'meta-llama/Meta-Llama-3.1-70B-Instruct'}]

根据乐于助人程度对来自不同 LLM 的生成结果进行评分,使用默认的结构化输出

from distilabel.steps.tasks import UltraFeedback
from distilabel.models import InferenceEndpointsLLM

# Consider this as a placeholder for your actual LLM.
ultrafeedback = UltraFeedback(
    llm=InferenceEndpointsLLM(
        model_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
        generation_kwargs={"max_new_tokens": 512},
    ),
    aspect="helpfulness"
)

ultrafeedback.load()

result = next(
    ultrafeedback.process(
        [
            {
                "instruction": "How much is 2+2?",
                "generations": ["4", "and a car"],
            }
        ]
    )
)
# result
# [{'instruction': 'How much is 2+2?',
#   'generations': ['4', 'and a car'],
#   'ratings': [1, 5],
#   'rationales': ['Text 1 is clear and relevant, providing the correct answer to the question. It is also not lengthy and does not contain repetition. However, it lacks comprehensive information or detailed description.',
#    'Text 2 is neither clear nor relevant to the task. It does not provide any useful information and seems unrelated to the question.'],
#   'rationales_for_rating': ['Text 1 is rated as Correct (3) because it provides the accurate answer to the question, but lacks comprehensive information or detailed description.',
#    'Text 2 is rated as Severely Incorrect (1) because it does not provide any relevant information and seems unrelated to the question.'],
#   'types': [1, 3, 1],
#   'distilabel_metadata': {'raw_output_ultra_feedback_0': '{ \n  "ratings": [\n    1,\n    5\n  ]\n ,\n  "rationales": [\n    "Text 1 is clear and relevant, providing the correct answer to the question. It is also not lengthy and does not contain repetition. However, it lacks comprehensive information or detailed description.",\n    "Text 2 is neither clear nor relevant to the task. It does not provide any useful information and seems unrelated to the question."\n  ]\n ,\n  "rationales_for_rating": [\n    "Text 1 is rated as Correct (3) because it provides the accurate answer to the question, but lacks comprehensive information or detailed description.",\n    "Text 2 is rated as Severely Incorrect (1) because it does not provide any relevant information and seems unrelated to the question."\n  ]\n ,\n  "types": [\n    1, 3,\n    1\n  ]\n  }'},
#   'model_name': 'meta-llama/Meta-Llama-3.1-70B-Instruct'}]

参考