CaptionSmiths: Flexibly Controlling Language Pattern in Image Captioning (ICCV 2025 Highlight)
October 16, 2025 ยท View on GitHub
Official repository for the paper:
๐ "CaptionSmiths: Flexibly Controlling Language Pattern in Image Captioning"
Kuniaki Saito, Donghyun Kim, Kwanyong Park, Atsushi Hashimoto, Yoshitaka Ushiku
International Conference on Computer Vision (ICCV) 2025 โ Highlight Paper
๐ Overview
CaptionSmiths is a controllable image captioning framework that enables smooth and interpretable control over caption properties โ including length, descriptiveness, and word uniqueness โ within a single model.
Unlike existing approaches that rely on discrete prompts or separate models, CaptionSmiths represents these properties as continuous scalar values and interpolates between learned endpoint embeddings (e.g., very short โ very long captions).
This allows fine-grained control and smooth transitions across diverse language patterns.
๐งฉ Key Features
- ๐๏ธ Continuous control over caption properties (length, descriptiveness, uniqueness)
- โ๏ธ Interpolation-based conditioning between learned endpoint vectors
- ๐ Smooth and interpretable caption transitions
๐ Getting Started
Installation
Please follow LLaVA for the environmental set-up.
Dataset Preparation
We employ COCO, Localized Narrative, Docci, Laion-COCO, Detail23K, and Monkey data.
Localized Narrative
Download training splits all data from this link. We employ COCO validation for testing in paper.
DOCCI
Download images and descriptions from this link.
Laion-COCO
Download data from this link. We download 00000 to 00148th tar balls. We employ 266K image-caption pairs.
Detail23K
Detail23K can be downloaded from here. Their corresponding images are COCO2017 train split.
Monkey
The dataset can be downloaded from here. We employ 'densecap' split of Moneydataset.
Example Directory Structure
$IMGPATH/
โโโ openimages/
โ โโโ 00000123.jpg
โ โโโ 00000456.jpg
โ โโโ ...
โโโ densecap_data/
โ โโโ GCC_train_000107306.jpg
โ โโโ GCC_train_000120411.jpg
โ โโโ ...
โโโ coco/
โโโ train2017/
โ โโโ 000000111111.jpg
โ โโโ 000000222222.jpg
โ โโโ ...
โโโ val2017/
โ โโโ 000000333333.jpg
โ โโโ 000000444444.jpg
โ โโโ ...
โโโ ...
Annotation data
Training data preparation.
We follow LLaVA and the annotation data (json) needs to have the following format.
{'id': 'densecap_data/GCC_train_000107306.jpg', 'conversations': [{'from': 'human', 'value': 'What do you see happening in this image?\n<image>'}, {'from': 'gpt', 'value': 'This is a winter landscape with a snow covered mountain and rocks. The sun is shining brightly in the sky, reflecting off the snowy surfaces. The word "alamy" is visible in blue and white text. A small black background with a white line can also be seen. The photo was taken by someone whose name appears in the description. Overall, it\'s a beautiful winter scene by the seaside.'}], 'image': 'monkey_images/densecap_data/GCC_train_000107306.jpg'}
Summarized annotation data we used is available at link, which concatates all datasets. Place validation data in the folder named ann_data.
ann_data/docci_val.json
ann_data/coco_val.json
ann_data/lncoco_val.json
Make merged.json file that concatenates all caption files.
Condition calculation must be done before training. Running the script below will output the data containing condition values in each row. $ANNCONDITION indicates the annotation data (json) where each row has the line with the format described above.
python ./preprocess/attach_condition.py --input_captions merged.json --output_path $ANNCONDITION
Model Preparation
We employ llama-2b-chat model. Download the model from huggingface and place the directory (llama-2b-chat) in this directory.
Training
We follow LLaVA and apply two-stage training.
- Mapping model and conditioning token training.
bash scripts/pretrain.sh $ANNCONDITION $IMGPATH ./pretrain_model_condition --condition_train
- Fine-tune model.
bash scripts/finetune.sh $ANNCONDITION $IMGPATH ./pretrain_model_condition/mm_projector.bin $MODEL_OUTPUT --condition_train
Inference and Evaluation
Evaluation
bash scripts/inference.sh $MODEL_OUTPUT $EVALJSON $IMGPATH $OUTPUT_PATH
bash scripts/eval_result.sh $OUTPUT_PATH $EVALJSON $IMGPATH ./llama-2-7b-chat
Sample inference
bash scripts/inference.sh $MODEL_OUTPUT $IMGPATH --length 0.3 --descript 0.3 --uniqueness 0.1
Reference
@inproceedings{saito2025captionsmiths,
title = {CaptionSmiths: Flexibly Controlling Language Pattern in Image Captioning},
author = {Saito, Kuniaki and Kim, Donghyun and Park, Kwanyong and Hashimoto, Atsushi and Ushiku, Yoshitaka},
booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
year = {2025},
}