使用结构化生成创建考试问题¶

此示例将展示如何从文本页面生成考试问题和答案。在本例中，我们将使用维基百科页面作为示例，并展示如何利用提示来帮助模型以适当的格式生成数据。

我们将使用 meta-llama/Meta-Llama-3.1-8B-Instruct 从维基百科页面为模拟考试生成问题和答案。在本例中，我们将使用迁移学习条目。借助结构化生成，我们将引导模型为我们创建易于解析的结构化数据。结构将是问题、答案和干扰项（错误答案）。

点击查看示例结果

示例页面迁移学习

页面的 QA

{
    "exam": [
        {
            "answer": "A technique in machine learning where knowledge learned from a task is re-used to boost performance on a related task.",
            "distractors": ["A type of neural network architecture", "A machine learning algorithm for image classification", "A method for data preprocessing"],
            "question": "What is transfer learning?"
        },
        {
            "answer": "1976",
            "distractors": ["1981", "1992", "1998"],
            "question": "In which year did Bozinovski and Fulgosi publish a paper addressing transfer learning in neural network training?"
        },
        {
            "answer": "Discriminability-based transfer (DBT) algorithm",
            "distractors": ["Multi-task learning", "Learning to Learn", "Cost-sensitive machine learning"],
            "question": "What algorithm was formulated by Lorien Pratt in 1992?"
        },
        {
            "answer": "A domain consists of a feature space and a marginal probability distribution.",
            "distractors": ["A domain consists of a label space and an objective predictive function.", "A domain consists of a task and a learning algorithm.", "A domain consists of a dataset and a model."],
            "question": "What is the definition of a domain in the context of transfer learning?"
        },
        {
            "answer": "Transfer learning aims to help improve the learning of the target predictive function in the target domain using the knowledge in the source domain and learning task.",
            "distractors": ["Transfer learning aims to learn a new task from scratch.", "Transfer learning aims to improve the learning of the source predictive function in the source domain.", "Transfer learning aims to improve the learning of the target predictive function in the source domain."],
            "question": "What is the goal of transfer learning?"
        },
        {
            "answer": "Markov logic networks, Bayesian networks, cancer subtype discovery, building utilization, general game playing, text classification, digit recognition, medical imaging, and spam filtering.",
            "distractors": ["Supervised learning, unsupervised learning, reinforcement learning, natural language processing, computer vision, and robotics.", "Image classification, object detection, segmentation, and tracking.", "Speech recognition, sentiment analysis, and topic modeling."],
            "question": "What are some applications of transfer learning?"
        },
        {
            "answer": "ADAPT (Python), TLib (Python), Domain-Adaptation-Toolbox (Matlab)",
            "distractors": ["TensorFlow, PyTorch, Keras", "Scikit-learn, OpenCV, NumPy", "Matlab, R, Julia"],
            "question": "What are some software implementations of transfer learning and domain adaptation algorithms?"
        }
    ]
}

构建 pipeline¶

让我们看看如何构建 pipeline 以获得此类数据

from typing import List
from pathlib import Path

from pydantic import BaseModel, Field

from distilabel.llms import InferenceEndpointsLLM
from distilabel.pipeline import Pipeline
from distilabel.steps import LoadDataFromDicts
from distilabel.steps.tasks import TextGeneration

import wikipedia

page = wikipedia.page(title="Transfer_learning")  # (1)


class ExamQuestion(BaseModel):
    question: str = Field(..., description="The question to be answered")
    answer: str = Field(..., description="The correct answer to the question")
    distractors: List[str] = Field(
        ..., description="A list of incorrect but viable answers to the question"
    )

class ExamQuestions(BaseModel):  # (2)
    exam: List[ExamQuestion]


SYSTEM_PROMPT = """\
You are an exam writer specialized in writing exams for students.
Your goal is to create questions and answers based on the document provided, and a list of distractors, that are incorrect but viable answers to the question.
Your answer must adhere to the following format:
```
[
    {
        "question": "Your question",
        "answer": "The correct answer to the question",
        "distractors": ["wrong answer 1", "wrong answer 2", "wrong answer 3"]
    },
    ... (more questions and answers as required)
]
```
""".strip() # (3)


with Pipeline(name="ExamGenerator") as pipeline:

    load_dataset = LoadDataFromDicts(
        name="load_instructions",
        data=[
            {
                "page": page.content,  # (4)
            }
        ],
    )

    text_generation = TextGeneration(  # (5)
        name="exam_generation",
        system_prompt=SYSTEM_PROMPT,
        template="Generate a list of answers and questions about the document. Document:\n\n{{ page }}",
        llm=InferenceEndpointsLLM(
            model_id="meta-llama/Meta-Llama-3.1-8B-Instruct",
            tokenizer_id="meta-llama/Meta-Llama-3.1-8B-Instruct",
            structured_output={
                "schema": ExamQuestions.model_json_schema(),
                "format": "json"
            },
        ),
        input_batch_size=8,
        output_mappings={"model_name": "generation_model"},
    )

    load_dataset >> text_generation  # (6)

下载单个页面以进行演示。我们可以先下载页面，或者将相同的过程应用于我们想要的任何类型的数据。在实际用例中，我们首先需要从这些文档中创建一个数据集。
使用 Pydantic 定义答案所需的结构。在本例中，我们希望每个页面都有一个包含问题和答案的列表（此外，我们还添加了干扰项，但在此案例中可以忽略）。因此，我们的输出将是一个 ExamQuestions 模型，它是一个 ExamQuestion 列表，其中每个模型都包含 question 和 answer 字段作为字符串字段。语言模型将使用字段描述来生成值。
使用系统提示来引导模型朝着我们期望的行为发展。独立于我们强制模型拥有的结构化输出，如果我们通过提示传递预期的格式，这将有所帮助。
将维基百科的页面内容移动到数据集中的一行。
TextGeneration 任务通过 template 参数获取系统提示和用户提示，我们在此参数中帮助模型根据页面内容生成问题和答案，页面内容将从加载数据的相应列中获得。
连接两个步骤，我们就完成了。

运行示例¶

要运行此示例，您首先需要安装 wikipedia 依赖项以下载示例数据，即 pip install wikipedia。如果您想使用您的帐户将数据集推送到 hub，请首先更改用户名。

运行

python examples/exam_questions.py

exam_questions.py

# Copyright 2023-present, Argilla, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://apache.ac.cn/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

from typing import List

import wikipedia
from pydantic import BaseModel, Field

from distilabel.llms import InferenceEndpointsLLM
from distilabel.pipeline import Pipeline
from distilabel.steps import LoadDataFromDicts
from distilabel.steps.tasks import TextGeneration

page = wikipedia.page(title="Transfer_learning")


class ExamQuestion(BaseModel):
    question: str = Field(..., description="The question to be answered")
    answer: str = Field(..., description="The correct answer to the question")
    distractors: List[str] = Field(
        ..., description="A list of incorrect but viable answers to the question"
    )


class ExamQuestions(BaseModel):
    exam: List[ExamQuestion]


SYSTEM_PROMPT = """\
You are an exam writer specialized in writing exams for students.
Your goal is to create questions and answers based on the document provided, and a list of distractors, that are incorrect but viable answers to the question.
Your answer must adhere to the following format:
```
[
    {
        "question": "Your question",
        "answer": "The correct answer to the question",
        "distractors": ["wrong answer 1", "wrong answer 2", "wrong answer 3"]
    },
    ... (more questions and answers as required)
]
```
""".strip()


with Pipeline(name="ExamGenerator") as pipeline:
    load_dataset = LoadDataFromDicts(
        name="load_instructions",
        data=[
            {
                "page": page.content,
            }
        ],
    )

    text_generation = TextGeneration(
        name="exam_generation",
        system_prompt=SYSTEM_PROMPT,
        template="Generate a list of answers and questions about the document. Document:\n\n{{ page }}",
        llm=InferenceEndpointsLLM(
            model_id="meta-llama/Meta-Llama-3.1-8B-Instruct",
            tokenizer_id="meta-llama/Meta-Llama-3.1-8B-Instruct",
            structured_output={
                "schema": ExamQuestions.model_json_schema(),
                "format": "json",
            },
        ),
        input_batch_size=8,
        output_mappings={"model_name": "generation_model"},
    )
    load_dataset >> text_generation


if __name__ == "__main__":
    distiset = pipeline.run(
        parameters={
            text_generation.name: {
                "llm": {
                    "generation_kwargs": {
                        "max_new_tokens": 2048,
                    }
                }
            }
        },
        use_cache=False,
    )
    distiset.push_to_hub("USERNAME/exam_questions")