RewardModelScore¶

使用奖励模型为 response 分配分数。

RewardModelScore 是一个 Step，它使用通过 transformers 加载的奖励模型 (RM)，为针对 instruction 生成的 response 分配分数，或者为多轮对话分配分数。

属性¶

model: 模型 Hugging Face Hub repo id 或包含模型权重和配置文件目录的路径。
revision: 如果 model 指的是 Hugging Face Hub 仓库，则使用修订版本（例如，分支名称或提交 id）。默认为 "main"。
torch_dtype: 模型使用的 torch dtype，例如 "float16"、"float32" 等。默认为 "auto"。
trust_remote_code: 是否允许获取和执行从 Hub 仓库获取的远程代码。默认为 False。
device_map: 将模型的每一层映射到设备的字典，或像 "sequential" 或 "auto" 这样的模式。默认为 None。
token: 用于向 Hugging Face Hub 验证身份的 Hugging Face Hub token。如果未提供，则将使用 HF_TOKEN 环境变量或 huggingface_hub 包本地配置。默认为 None。
truncation: 是否在最大长度处截断序列。默认为 False。
max_length: 用于 padding 或 truncation 的最大长度。默认为 None。

输入和输出列¶

graph TD
    subgraph Dataset
        subgraph Columns
            ICOL0[instruction]
            ICOL1[response]
            ICOL2[conversation]
        end
        subgraph New columns
            OCOL0[score]
        end
    end

    subgraph RewardModelScore
        StepInput[Input Columns: instruction, response, conversation]
        StepOutput[Output Columns: score]
    end

    ICOL0 --> StepInput
    ICOL1 --> StepInput
    ICOL2 --> StepInput
    StepOutput --> OCOL0
    StepInput --> StepOutput

输入¶

instruction (str, 可选): 用于生成 response 的 instruction。如果提供，则也必须提供 response。
response (str, 可选): 为 instruction 生成的 response。如果提供，则也必须提供 instruction。
conversation (ChatType, 可选): 多轮对话。如果未提供，则必须提供 instruction 和 response 列。

输出¶

score (float): 奖励模型为 instruction-response 对或对话给出的分数。

示例¶

response pair¶

from distilabel.steps import RewardModelScore

step = RewardModelScore(
    model="RLHFlow/ArmoRM-Llama3-8B-v0.1", device_map="auto", trust_remote_code=True
)

step.load()

result = next(
    step.process(
        inputs=[
            {
                "instruction": "How much is 2+2?",
                "response": "The output of 2+2 is 4",
            },
            {"instruction": "How much is 2+2?", "response": "4"},
        ]
    )
)
# [
#   {'instruction': 'How much is 2+2?', 'response': 'The output of 2+2 is 4', 'score': 0.11690367758274078},
#   {'instruction': 'How much is 2+2?', 'response': '4', 'score': 0.10300665348768234}
# ]

turn conversation¶

from distilabel.steps import RewardModelScore

step = RewardModelScore(
    model="RLHFlow/ArmoRM-Llama3-8B-v0.1", device_map="auto", trust_remote_code=True
)

step.load()

result = next(
    step.process(
        inputs=[
            {
                "conversation": [
                    {"role": "user", "content": "How much is 2+2?"},
                    {"role": "assistant", "content": "The output of 2+2 is 4"},
                ],
            },
            {
                "conversation": [
                    {"role": "user", "content": "How much is 2+2?"},
                    {"role": "assistant", "content": "4"},
                ],
            },
        ]
    )
)
# [
#   {'conversation': [{'role': 'user', 'content': 'How much is 2+2?'}, {'role': 'assistant', 'content': 'The output of 2+2 is 4'}], 'score': 0.11690367758274078},
#   {'conversation': [{'role': 'user', 'content': 'How much is 2+2?'}, {'role': 'assistant', 'content': '4'}], 'score': 0.10300665348768234}
# ]