跳到内容

RewardModelScore

使用奖励模型为 response 分配分数。

RewardModelScore 是一个 Step,它使用通过 transformers 加载的奖励模型 (RM),为针对 instruction 生成的 response 分配分数,或者为多轮对话分配分数。

属性

  • model: 模型 Hugging Face Hub repo id 或包含模型权重和配置文件目录的路径。

  • revision: 如果 model 指的是 Hugging Face Hub 仓库,则使用修订版本(例如,分支名称或提交 id)。默认为 "main"

  • torch_dtype: 模型使用的 torch dtype,例如 "float16"、"float32" 等。默认为 "auto"

  • trust_remote_code: 是否允许获取和执行从 Hub 仓库获取的远程代码。默认为 False

  • device_map: 将模型的每一层映射到设备的字典,或像 "sequential""auto" 这样的模式。默认为 None

  • token: 用于向 Hugging Face Hub 验证身份的 Hugging Face Hub token。如果未提供,则将使用 HF_TOKEN 环境变量或 huggingface_hub 包本地配置。默认为 None

  • truncation: 是否在最大长度处截断序列。默认为 False

  • max_length: 用于 padding 或 truncation 的最大长度。默认为 None

输入和输出列

graph TD
    subgraph Dataset
        subgraph Columns
            ICOL0[instruction]
            ICOL1[response]
            ICOL2[conversation]
        end
        subgraph New columns
            OCOL0[score]
        end
    end

    subgraph RewardModelScore
        StepInput[Input Columns: instruction, response, conversation]
        StepOutput[Output Columns: score]
    end

    ICOL0 --> StepInput
    ICOL1 --> StepInput
    ICOL2 --> StepInput
    StepOutput --> OCOL0
    StepInput --> StepOutput

输入

  • instruction (str, 可选): 用于生成 response 的 instruction。如果提供,则也必须提供 response

  • response (str, 可选): 为 instruction 生成的 response。如果提供,则也必须提供 instruction

  • conversation (ChatType, 可选): 多轮对话。如果未提供,则必须提供 instructionresponse 列。

输出

  • score (float): 奖励模型为 instruction-response 对或对话给出的分数。

示例

response pair

from distilabel.steps import RewardModelScore

step = RewardModelScore(
    model="RLHFlow/ArmoRM-Llama3-8B-v0.1", device_map="auto", trust_remote_code=True
)

step.load()

result = next(
    step.process(
        inputs=[
            {
                "instruction": "How much is 2+2?",
                "response": "The output of 2+2 is 4",
            },
            {"instruction": "How much is 2+2?", "response": "4"},
        ]
    )
)
# [
#   {'instruction': 'How much is 2+2?', 'response': 'The output of 2+2 is 4', 'score': 0.11690367758274078},
#   {'instruction': 'How much is 2+2?', 'response': '4', 'score': 0.10300665348768234}
# ]

turn conversation

from distilabel.steps import RewardModelScore

step = RewardModelScore(
    model="RLHFlow/ArmoRM-Llama3-8B-v0.1", device_map="auto", trust_remote_code=True
)

step.load()

result = next(
    step.process(
        inputs=[
            {
                "conversation": [
                    {"role": "user", "content": "How much is 2+2?"},
                    {"role": "assistant", "content": "The output of 2+2 is 4"},
                ],
            },
            {
                "conversation": [
                    {"role": "user", "content": "How much is 2+2?"},
                    {"role": "assistant", "content": "4"},
                ],
            },
        ]
    )
)
# [
#   {'conversation': [{'role': 'user', 'content': 'How much is 2+2?'}, {'role': 'assistant', 'content': 'The output of 2+2 is 4'}], 'score': 0.11690367758274078},
#   {'conversation': [{'role': 'user', 'content': 'How much is 2+2?'}, {'role': 'assistant', 'content': '4'}], 'score': 0.10300665348768234}
# ]