PrometheusEval¶

使用 Prometheus 2.0 评判和排序来自 LLM 的生成质量。

PrometheusEval 是为 Prometheus 2.0 创建的任务，涵盖绝对和相对评估。绝对评估，即 mode="absolute"，用于评估来自 LLM 的单个生成，针对给定的指令。相对评估，即 mode="relative"，用于评估来自 LLM 的两个生成，针对给定的指令。两种评估都可能使用参考答案进行比较，无论是否使用 reference 属性，并且都基于评分准则，该准则根据以下默认方面评判生成：helpfulness、harmlessness、honesty、factual-validity 和 reasoning，这些方面可以通过 rubrics 覆盖，并且所选的准则通过属性 rubric 设置。

注意¶

PrometheusEval 任务更适合且旨在与 Kaist AI 发布的任何 Prometheus 2.0 模型一起使用，包括：https://hugging-face.cn/prometheus-eval/prometheus-7b-v2.0 和 https://hugging-face.cn/prometheus-eval/prometheus-8x7b-v2.0。如果使用其他模型，即使某些其他模型也可能能够正确遵循格式并生成有见地的评判，但不能保证评判评估的格式和质量。

属性¶

mode: 要使用的评估模式，absolute 或 relative。它定义了任务将评估一个或两个生成。
rubric: 在提示中使用以基于不同方面运行评判的评分准则。可以是 rubrics 属性中任何现有的键，默认情况下，这意味着它可以是：helpfulness、harmlessness、honesty、factual-validity 或 reasoning。这些仅在使用默认 rubrics 时有效，否则应使用提供的 rubrics。
rubrics: 一个字典，包含用于评判的不同准则，其中键是准则名称，值是准则描述。默认准则是以下内容：helpfulness、harmlessness、honesty、factual-validity 和 reasoning。
reference: 一个布尔标志，指示是否将提供参考答案/完成，以便模型评判基于与其的比较。这意味着除了其余输入之外，还需要在输入数据中提供列 reference。
_template: 用于格式化 LLM 输入的 Jinja2 模板。

输入和输出列¶

graph TD
    subgraph Dataset
        subgraph Columns
            ICOL0[instruction]
            ICOL1[generation]
            ICOL2[generations]
            ICOL3[reference]
        end
        subgraph New columns
            OCOL0[feedback]
            OCOL1[result]
            OCOL2[model_name]
        end
    end

    subgraph PrometheusEval
        StepInput[Input Columns: instruction, generation, generations, reference]
        StepOutput[Output Columns: feedback, result, model_name]
    end

    ICOL0 --> StepInput
    ICOL1 --> StepInput
    ICOL2 --> StepInput
    ICOL3 --> StepInput
    StepOutput --> OCOL0
    StepOutput --> OCOL1
    StepOutput --> OCOL2
    StepInput --> StepOutput

输入¶

instruction (str): 用作参考的指令。
generation (str, 可选): 来自给定 instruction 的生成文本。如果 mode=absolute，则此列是必需的。
generations (List[str], 可选): 来自给定 instruction 的生成文本。它应该只包含 2 个生成。如果 mode=relative，则此列是必需的。
reference (str, 可选): 用于与 LLM 进行比较的 instruction 的参考/黄金答案。

输出¶

feedback (str): 反馈解释了下面的结果，由 LLM 使用预定义的评分准则进行评判，如果提供了 reference，则会与 reference 进行比较。
result (Union[int, Literal["A", "B"]]): 如果 mode=absolute，则结果包含 generation 的评分，评分范围为 1-5 的 Likert 量表；否则，如果 mode=relative，则结果包含 “A” 或 “B”，“获胜” 的一个是 generations 中索引 0 的生成（如果 result='A'），或者索引 1 的生成（如果 result='B'）。
model_name (str): 用于生成 feedback 和 result 的模型名称。

示例¶

使用 Prometheus 2_0 评判和评估 LLM 生成质量¶

from distilabel.steps.tasks import PrometheusEval
from distilabel.models import vLLM

# Consider this as a placeholder for your actual LLM.
prometheus = PrometheusEval(
    llm=vLLM(
        model="prometheus-eval/prometheus-7b-v2.0",
        chat_template="[INST] {{ messages[0]"content" }}\n{{ messages[1]"content" }}[/INST]",
    ),
    mode="absolute",
    rubric="factual-validity"
)

prometheus.load()

result = next(
    prometheus.process(
        [
            {"instruction": "make something", "generation": "something done"},
        ]
    )
)
# result
# [
#     {
#         'instruction': 'make something',
#         'generation': 'something done',
#         'model_name': 'prometheus-eval/prometheus-7b-v2.0',
#         'feedback': 'the feedback',
#         'result': 6,
#     }
# ]

相对评估的评判¶

from distilabel.steps.tasks import PrometheusEval
from distilabel.models import vLLM

# Consider this as a placeholder for your actual LLM.
prometheus = PrometheusEval(
    llm=vLLM(
        model="prometheus-eval/prometheus-7b-v2.0",
        chat_template="[INST] {{ messages[0]"content" }}\n{{ messages[1]"content" }}[/INST]",
    ),
    mode="relative",
    rubric="honesty"
)

prometheus.load()

result = next(
    prometheus.process(
        [
            {"instruction": "make something", "generations": ["something done", "other thing"]},
        ]
    )
)
# result
# [
#     {
#         'instruction': 'make something',
#         'generations': ['something done', 'other thing'],
#         'model_name': 'prometheus-eval/prometheus-7b-v2.0',
#         'feedback': 'the feedback',
#         'result': 'something done',
#     }
# ]

使用自定义准则进行评判¶

from distilabel.steps.tasks import PrometheusEval
from distilabel.models import vLLM

# Consider this as a placeholder for your actual LLM.
prometheus = PrometheusEval(
    llm=vLLM(
        model="prometheus-eval/prometheus-7b-v2.0",
        chat_template="[INST] {{ messages[0]"content" }}\n{{ messages[1]"content" }}[/INST]",
    ),
    mode="absolute",
    rubric="custom",
    rubrics={
        "custom": "[A]\nScore 1: A\nScore 2: B\nScore 3: C\nScore 4: D\nScore 5: E"
    }
)

prometheus.load()

result = next(
    prometheus.process(
        [
            {"instruction": "make something", "generation": "something done"},
        ]
    )
)
# result
# [
#     {
#         'instruction': 'make something',
#         'generation': 'something done',
#         'model_name': 'prometheus-eval/prometheus-7b-v2.0',
#         'feedback': 'the feedback',
#         'result': 6,
#     }
# ]

使用参考答案进行评判¶

from distilabel.steps.tasks import PrometheusEval
from distilabel.models import vLLM

# Consider this as a placeholder for your actual LLM.
prometheus = PrometheusEval(
    llm=vLLM(
        model="prometheus-eval/prometheus-7b-v2.0",
        chat_template="[INST] {{ messages[0]"content" }}\n{{ messages[1]"content" }}[/INST]",
    ),
    mode="absolute",
    rubric="helpfulness",
    reference=True,
)

prometheus.load()

result = next(
    prometheus.process(
        [
            {
                "instruction": "make something",
                "generation": "something done",
                "reference": "this is a reference answer",
            },
        ]
    )
)
# result
# [
#     {
#         'instruction': 'make something',
#         'generation': 'something done',
#         'reference': 'this is a reference answer',
#         'model_name': 'prometheus-eval/prometheus-7b-v2.0',
#         'feedback': 'the feedback',
#         'result': 6,
#     }
# ]