CaptionSmiths: Flexibly Controlling Language Pattern in Image Captioning (ICCV 2025 Highlight)

October 16, 2025 ยท View on GitHub

Official repository for the paper:
๐Ÿ“„ "CaptionSmiths: Flexibly Controlling Language Pattern in Image Captioning"

Kuniaki Saito, Donghyun Kim, Kwanyong Park, Atsushi Hashimoto, Yoshitaka Ushiku
International Conference on Computer Vision (ICCV) 2025 โ€” Highlight Paper


๐ŸŒŸ Overview

CaptionSmiths is a controllable image captioning framework that enables smooth and interpretable control over caption properties โ€” including length, descriptiveness, and word uniqueness โ€” within a single model.

Unlike existing approaches that rely on discrete prompts or separate models, CaptionSmiths represents these properties as continuous scalar values and interpolates between learned endpoint embeddings (e.g., very short โ†” very long captions).
This allows fine-grained control and smooth transitions across diverse language patterns.


๐Ÿงฉ Key Features

  • ๐ŸŽš๏ธ Continuous control over caption properties (length, descriptiveness, uniqueness)
  • โš™๏ธ Interpolation-based conditioning between learned endpoint vectors
  • ๐Ÿ“Š Smooth and interpretable caption transitions

๐Ÿš€ Getting Started

Installation

Please follow LLaVA for the environmental set-up.


Dataset Preparation

We employ COCO, Localized Narrative, Docci, Laion-COCO, Detail23K, and Monkey data.

Localized Narrative

Download training splits all data from this link. We employ COCO validation for testing in paper.

DOCCI

Download images and descriptions from this link.

Laion-COCO

Download data from this link. We download 00000 to 00148th tar balls. We employ 266K image-caption pairs.

Detail23K

Detail23K can be downloaded from here. Their corresponding images are COCO2017 train split.

Monkey

The dataset can be downloaded from here. We employ 'densecap' split of Moneydataset.

Example Directory Structure

$IMGPATH/
โ”œโ”€โ”€ openimages/
โ”‚ โ”œโ”€โ”€ 00000123.jpg
โ”‚ โ”œโ”€โ”€ 00000456.jpg
โ”‚ โ””โ”€โ”€ ...
โ”œโ”€โ”€ densecap_data/
โ”‚ โ”œโ”€โ”€ GCC_train_000107306.jpg
โ”‚ โ”œโ”€โ”€ GCC_train_000120411.jpg
โ”‚ โ””โ”€โ”€ ...
โ””โ”€โ”€ coco/
โ”œโ”€โ”€ train2017/
โ”‚ โ”œโ”€โ”€ 000000111111.jpg
โ”‚ โ”œโ”€โ”€ 000000222222.jpg
โ”‚ โ””โ”€โ”€ ...
โ”œโ”€โ”€ val2017/
โ”‚ โ”œโ”€โ”€ 000000333333.jpg
โ”‚ โ”œโ”€โ”€ 000000444444.jpg
โ”‚ โ””โ”€โ”€ ...
โ””โ”€โ”€ ...

Annotation data

Training data preparation.

We follow LLaVA and the annotation data (json) needs to have the following format.

{'id': 'densecap_data/GCC_train_000107306.jpg', 'conversations': [{'from': 'human', 'value': 'What do you see happening in this image?\n<image>'}, {'from': 'gpt', 'value': 'This is a winter landscape with a snow covered mountain and rocks. The sun is shining brightly in the sky, reflecting off the snowy surfaces. The word "alamy" is visible in blue and white text. A small black background with a white line can also be seen. The photo was taken by someone whose name appears in the description. Overall, it\'s a beautiful winter scene by the seaside.'}], 'image': 'monkey_images/densecap_data/GCC_train_000107306.jpg'}

Summarized annotation data we used is available at link, which concatates all datasets. Place validation data in the folder named ann_data.

ann_data/docci_val.json
ann_data/coco_val.json
ann_data/lncoco_val.json

Make merged.json file that concatenates all caption files.

Condition calculation must be done before training. Running the script below will output the data containing condition values in each row. $ANNCONDITION indicates the annotation data (json) where each row has the line with the format described above.

python ./preprocess/attach_condition.py --input_captions merged.json --output_path $ANNCONDITION

Model Preparation

We employ llama-2b-chat model. Download the model from huggingface and place the directory (llama-2b-chat) in this directory.


Training

We follow LLaVA and apply two-stage training.

  1. Mapping model and conditioning token training.
bash scripts/pretrain.sh $ANNCONDITION $IMGPATH ./pretrain_model_condition --condition_train
  1. Fine-tune model.
bash scripts/finetune.sh $ANNCONDITION $IMGPATH ./pretrain_model_condition/mm_projector.bin $MODEL_OUTPUT --condition_train

Inference and Evaluation

Evaluation

bash scripts/inference.sh $MODEL_OUTPUT $EVALJSON $IMGPATH $OUTPUT_PATH
bash scripts/eval_result.sh $OUTPUT_PATH $EVALJSON $IMGPATH ./llama-2-7b-chat

Sample inference

bash scripts/inference.sh $MODEL_OUTPUT $IMGPATH --length 0.3 --descript 0.3 --uniqueness 0.1

Reference

@inproceedings{saito2025captionsmiths,
  title     = {CaptionSmiths: Flexibly Controlling Language Pattern in Image Captioning},
  author    = {Saito, Kuniaki and Kim, Donghyun and Park, Kwanyong and Hashimoto, Atsushi and Ushiku, Yoshitaka},
  booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
  year      = {2025},
}