imagecaptioningmapper

February 4, 2026 · View on GitHub

Generates image captions using a Hugging Face model and appends them to samples.

This operator generates captions for images in the input samples using a specified Hugging Face model. It can generate multiple captions per image and apply different strategies to retain the generated captions. The operator supports three retention modes: 'random_any', 'similar_one_simhash', and 'all'. In 'random_any' mode, a random caption is retained. In 'similar_one_simhash' mode, the most similar caption to the original text (based on SimHash) is retained. In 'all' mode, all generated captions are concatenated and retained. The operator can also keep or discard the original sample based on the keep_original_sample parameter. If both prompt and prompt_key are set, the prompt_key takes precedence.

使用 Hugging Face 模型生成图像描述并将其附加到样本中。

该算子使用指定的 Hugging Face 模型为输入样本中的图像生成描述。它可以为每张图像生成多个描述,并应用不同的策略来保留生成的描述。该算子支持三种保留模式:'random_any'、'similar_one_simhash' 和 'all'。在 'random_any' 模式下,随机保留一个描述。在 'similar_one_simhash' 模式下,保留与原始文本最相似的描述(基于 SimHash)。在 'all' 模式下,所有生成的描述被连接并保留。该算子还可以根据 keep_original_sample 参数保留或丢弃原始样本。如果同时设置了 promptprompt_key,则 prompt_key 优先。

Type 算子类型: mapper

Tags 标签: gpu, hf, multimodal

🔧 Parameter Configuration 参数配置

name 参数名type 类型default 默认值desc 说明
hf_img2seq<class 'str'>'Salesforce/blip2-opt-2.7b'model name on huggingface to generate caption
trust_remote_code<class 'bool'>Falsewhether to trust the remote code of HF models.
caption_numtyping.Annotated[int, Gt(gt=0)]1how many candidate captions to generate for each image
keep_candidate_mode<class 'str'>'random_any'retain strategy for the generated captionnumcaption_num candidates. 'random_any': Retain the random one from generated captions 'similar_one_simhash': Retain the generated one that is most similar to the original caption 'all': Retain all generated captions by concatenation Note: This is a batched_OP, whose input and output type are both list. Suppose there are NN list of input samples, whose batch size is bb, and denote caption_num as MM. The number of total samples after generation is $2NbwhenkeeporiginalsampleisTrueandwhen keep_original_sample is True andNbwhenkeeporiginalsampleisFalse.Forrandomanyandsimilaronesimhashmode,itswhen keep_original_sample is False. For 'random_any' and 'similar_one_simhash' mode, it's(1+M)NbforallmodewhenkeeporiginalsampleisTrueandfor 'all' mode when keep_original_sample is True andMNb$ when keep_original_sample is False.
keep_original_sample<class 'bool'>Truewhether to keep the original sample. If it's set to False, there will be only generated captions in the final datasets and the original captions will be removed. It's True in default.
prompttyping.Optional[str]Nonea string prompt to guide the generation of blip2 model for all samples globally. It's None in default, which means no prompt provided.
prompt_keytyping.Optional[str]Nonethe key name of fields in samples to store prompts for each sample. It's used for set different prompts for different samples. If it's none, use prompt in parameter "prompt". It's None in default.
gpu_batch_sizetyping.Annotated[int, Gt(gt=0)]8the batch size for GPU inference. This controls how many images are processed together in a single GPU forward pass. Useful when the dataset batch size is larger than what the GPU can handle. Default is 8.
args''extra args
kwargs''extra args