跳到内容

TextGenerationWithImage

使用 LLM 和提示进行图像文本生成。

TextGenerationWithImage 是一个预定义的任务,允许使用 Jinja2 语法传递自定义提示。默认情况下,输入中需要 instruction,但使用 templatecolumns 属性可以定义自定义提示和文本中期望的列。此外,还需要一个 image 列,其中包含 url、base64 编码的图像或 PIL 图像之一。此任务继承自 TextGeneration,因此该任务中与提示相关的所有功能也适用于此处。

属性

  • system_prompt: 生成中使用的系统提示。如果未设置,则不使用系统提示。默认为 None

  • template: 用于生成的模板。它必须遵循 Jinja2 模板语法。如果未提供,则假定传递的文本是指令并构建相应的模板。

  • columns: 模板中期望的列的字符串或列的列表。请查看示例以获取更多信息。默认为 instruction

  • image_type: 提供的图像类型,这将用于在必要时进行预处理。必须是 "url"、"base64" 或 "PIL" 之一。

输入 & 输出列

graph TD
    subgraph Dataset
        subgraph Columns
            ICOL0[dynamic]
        end
        subgraph New columns
            OCOL0[generation]
            OCOL1[model_name]
        end
    end

    subgraph TextGenerationWithImage
        StepInput[Input Columns: dynamic]
        StepOutput[Output Columns: generation, model_name]
    end

    ICOL0 --> StepInput
    StepOutput --> OCOL0
    StepOutput --> OCOL1
    StepInput --> StepOutput

输入

  • dynamic (由 columns 属性确定): 默认情况下将设置为 instruction。这些列可以指向 strlist[str],以便在模板中使用。

输出

  • generation (str): 生成的文本。

  • model_name (str): 用于生成文本的模型的名称。

示例

根据图像回答问题

from distilabel.steps.tasks import TextGenerationWithImage
from distilabel.models.llms import InferenceEndpointsLLM

vision = TextGenerationWithImage(
    name="vision_gen",
    llm=InferenceEndpointsLLM(
        model_id="meta-llama/Llama-3.2-11B-Vision-Instruct",
    ),
    image_type="url"
)

vision.load()

result = next(
    vision.process(
        [
            {
                "instruction": "What’s in this image?",
                "image": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
            }
        ]
    )
)
# result
# [
#     {
#         "instruction": "What’s in this image?",
#         "image": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg",
#         "generation": "Based on the visual cues in the image...",
#         "model_name": "meta-llama/Llama-3.2-11B-Vision-Instruct"
#         ... # distilabel_metadata would be here
#     }
# ]
# result[0]["generation"]
# "Based on the visual cues in the image, here are some possible story points:

* The image features a wooden boardwalk leading through a lush grass field, possibly in a park or nature reserve.

Analysis and Ideas:
* The abundance of green grass and trees suggests a healthy ecosystem or habitat.
* The presence of wildlife, such as birds or deer, is possible based on the surroundings.
* A footbridge or a pathway might be a common feature in this area, providing access to nearby attractions or points of interest.

Additional Questions to Ask:
* Why is a footbridge present in this area?
* What kind of wildlife inhabits this region"

根据 base64 存储的图像回答问题

# For this example we will assume that we have the string representation of the image
# stored, but will just take the image and transform it to base64 to ilustrate the example.
import requests
import base64

image_url ="https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
img = requests.get(image_url).content
base64_image = base64.b64encode(img).decode("utf-8")

from distilabel.steps.tasks import TextGenerationWithImage
from distilabel.models.llms import InferenceEndpointsLLM

vision = TextGenerationWithImage(
    name="vision_gen",
    llm=InferenceEndpointsLLM(
        model_id="meta-llama/Llama-3.2-11B-Vision-Instruct",
    ),
    image_type="base64"
)

vision.load()

result = next(
    vision.process(
        [
            {
                "instruction": "What’s in this image?",
                "image": base64_image
            }
        ]
    )
)

参考