EmbeddingDedup¶
使用嵌入去重文本。
EmbeddingDedup
是一个 Step,它检测数据集中的近似重复项,使用嵌入来比较文本之间的相似性。此 step 的典型工作流程包括拥有一个预先计算好嵌入的数据集,然后(可能使用 FaissNearestNeighbour
)使用 nn_indices
和 nn_scores
,确定重复的文本。
属性¶
- threshold: 将 2 个示例视为重复项的阈值。它取决于用于生成嵌入的索引类型。例如,如果嵌入是使用余弦相似度生成的,则阈值
0.9
会使余弦相似度高于该值的所有文本都成为重复项。较高的值在此类索引中检测到的重复项较少,但在构建索引时应考虑到这一点。默认为0.9
。运行时参数: -threshold
:将 2 个示例视为重复项的阈值。
输入和输出列¶
graph TD
subgraph Dataset
subgraph Columns
ICOL0[nn_indices]
ICOL1[nn_scores]
end
subgraph New columns
OCOL0[keep_row_after_embedding_filtering]
end
end
subgraph EmbeddingDedup
StepInput[Input Columns: nn_indices, nn_scores]
StepOutput[Output Columns: keep_row_after_embedding_filtering]
end
ICOL0 --> StepInput
ICOL1 --> StepInput
StepOutput --> OCOL0
StepInput --> StepOutput
输入¶
-
nn_indices (
List[int]
): 一个列表,其中包含输入中行k
个最近邻的索引。 -
nn_scores (
List[float]
): 一个列表,其中包含到输入中每个k
个最近邻的得分或距离。
输出¶
- keep_row_after_embedding_filtering (
bool
): 布尔值,指示 piecetext
是否不是重复项,即应保留此文本。
示例¶
使用嵌入信息去重文本列表¶
from distilabel.pipeline import Pipeline
from distilabel.steps import EmbeddingDedup
from distilabel.steps import LoadDataFromDicts
with Pipeline() as pipeline:
data = LoadDataFromDicts(
data=[
{
"persona": "A chemistry student or academic researcher interested in inorganic or physical chemistry, likely at an advanced undergraduate or graduate level, studying acid-base interactions and chemical bonding.",
"embedding": [
0.018477669046149742,
-0.03748236608841726,
0.001919870620352492,
0.024918478063770535,
0.02348063521315178,
0.0038251285566308375,
-0.01723884983037716,
0.02881971942372201,
],
"nn_indices": [0, 1],
"nn_scores": [
0.9164746999740601,
0.782106876373291,
],
},
{
"persona": "A music teacher or instructor focused on theoretical and practical piano lessons.",
"embedding": [
-0.0023464179614082125,
-0.07325472251663565,
-0.06058678419516501,
-0.02100326928586996,
-0.013462744792362657,
0.027368447064244242,
-0.003916070100455717,
0.01243614518480423,
],
"nn_indices": [0, 2],
"nn_scores": [
0.7552462220191956,
0.7261884808540344,
],
},
{
"persona": "A classical guitar teacher or instructor, likely with experience teaching beginners, who focuses on breaking down complex music notation into understandable steps for their students.",
"embedding": [
-0.01630817942328242,
-0.023760151552345232,
-0.014249650090627883,
-0.005713686451446624,
-0.016033059279131567,
0.0071440908501058786,
-0.05691099643425161,
0.01597412704817784,
],
"nn_indices": [1, 2],
"nn_scores": [
0.8107735514640808,
0.7172299027442932,
],
},
],
batch_size=batch_size,
)
# In general you should do something like this before the deduplication step, to obtain the
# `nn_indices` and `nn_scores`. In this case the embeddings are already normalized, so there's
# no need for it.
# nn = FaissNearestNeighbour(
# k=30,
# metric_type=faiss.METRIC_INNER_PRODUCT,
# search_batch_size=50,
# train_size=len(dataset), # The number of embeddings to use for training
# string_factory="IVF300_HNSW32,Flat" # To use an index (optional, maybe required for big datasets)
# )
# Read more about the `string_factory` here:
# https://github.com/facebookresearch/faiss/wiki/Guidelines-to-choose-an-index
embedding_dedup = EmbeddingDedup(
threshold=0.8,
input_batch_size=batch_size,
)
data >> embedding_dedup
if __name__ == "__main__":
distiset = pipeline.run(use_cache=False)
ds = distiset["default"]["train"]
# Filter out the duplicates
ds_dedup = ds.filter(lambda x: x["keep_row_after_embedding_filtering"])