TextGenerationWithImage¶
使用 LLM
和提示进行图像文本生成。
TextGenerationWithImage
是一个预定义的任务,允许使用 Jinja2 语法传递自定义提示。默认情况下,输入中需要 instruction
,但使用 template
和 columns
属性可以定义自定义提示和文本中期望的列。此外,还需要一个 image
列,其中包含 url、base64 编码的图像或 PIL 图像之一。此任务继承自 TextGeneration
,因此该任务中与提示相关的所有功能也适用于此处。
属性¶
-
system_prompt: 生成中使用的系统提示。如果未设置,则不使用系统提示。默认为
None
。 -
template: 用于生成的模板。它必须遵循 Jinja2 模板语法。如果未提供,则假定传递的文本是指令并构建相应的模板。
-
columns: 模板中期望的列的字符串或列的列表。请查看示例以获取更多信息。默认为
instruction
。 -
image_type: 提供的图像类型,这将用于在必要时进行预处理。必须是 "url"、"base64" 或 "PIL" 之一。
输入 & 输出列¶
graph TD
subgraph Dataset
subgraph Columns
ICOL0[dynamic]
end
subgraph New columns
OCOL0[generation]
OCOL1[model_name]
end
end
subgraph TextGenerationWithImage
StepInput[Input Columns: dynamic]
StepOutput[Output Columns: generation, model_name]
end
ICOL0 --> StepInput
StepOutput --> OCOL0
StepOutput --> OCOL1
StepInput --> StepOutput
输入¶
- dynamic (由
columns
属性确定): 默认情况下将设置为instruction
。这些列可以指向str
或list[str]
,以便在模板中使用。
输出¶
-
generation (
str
): 生成的文本。 -
model_name (
str
): 用于生成文本的模型的名称。
示例¶
根据图像回答问题¶
from distilabel.steps.tasks import TextGenerationWithImage
from distilabel.models.llms import InferenceEndpointsLLM
vision = TextGenerationWithImage(
name="vision_gen",
llm=InferenceEndpointsLLM(
model_id="meta-llama/Llama-3.2-11B-Vision-Instruct",
),
image_type="url"
)
vision.load()
result = next(
vision.process(
[
{
"instruction": "What’s in this image?",
"image": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
}
]
)
)
# result
# [
# {
# "instruction": "What’s in this image?",
# "image": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg",
# "generation": "Based on the visual cues in the image...",
# "model_name": "meta-llama/Llama-3.2-11B-Vision-Instruct"
# ... # distilabel_metadata would be here
# }
# ]
# result[0]["generation"]
# "Based on the visual cues in the image, here are some possible story points:
* The image features a wooden boardwalk leading through a lush grass field, possibly in a park or nature reserve.
Analysis and Ideas:
* The abundance of green grass and trees suggests a healthy ecosystem or habitat.
* The presence of wildlife, such as birds or deer, is possible based on the surroundings.
* A footbridge or a pathway might be a common feature in this area, providing access to nearby attractions or points of interest.
Additional Questions to Ask:
* Why is a footbridge present in this area?
* What kind of wildlife inhabits this region"
根据 base64 存储的图像回答问题¶
# For this example we will assume that we have the string representation of the image
# stored, but will just take the image and transform it to base64 to ilustrate the example.
import requests
import base64
image_url ="https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
img = requests.get(image_url).content
base64_image = base64.b64encode(img).decode("utf-8")
from distilabel.steps.tasks import TextGenerationWithImage
from distilabel.models.llms import InferenceEndpointsLLM
vision = TextGenerationWithImage(
name="vision_gen",
llm=InferenceEndpointsLLM(
model_id="meta-llama/Llama-3.2-11B-Vision-Instruct",
),
image_type="base64"
)
vision.load()
result = next(
vision.process(
[
{
"instruction": "What’s in this image?",
"image": base64_image
}
]
)
)