Visually Dehallucinative Instruction Generation

March 19, 2024 · View on GitHub

(CAP2QA) Visually Dehallucinative Instruction Generation [paper]
Sungguk Cha, Jusung Lee, Younghyun Lee and Cheoljong Yang

See also, (IDK) Visually Dehallucinative Instruction Generation: Know What You Don't Know [paper] [github]

CAP2QA

Image-aligned Sentence Level VQA Data


Details

DatasetAvg. #word Question/Answer#Image#QuestionScalableImageAlignedRecognitionDescriptionReasoning
DAQUAR11.5/1.1 (word)1,44912,468×\times\checkmark\checkmark×\times×\times
VQAv26.1/1.2 (word)200k1.1M×\times\checkmark\checkmark×\times×\times
OKVQA8.1/1.3 (word)14,03114,055×\times×\times\checkmark×\times\checkmark
LLaVA10.7/60.7 (sentence)80,000221,333\checkmark×\times\checkmark\checkmark\checkmark
CAP2QA (Ours)7.2/5.4 (sentence)122,906873,631\checkmark\checkmark\checkmark\checkmark\checkmark

Prepare MSCOCO 2017 images. Train/Val splits are preserved.

Citation

If you find CAP2QA useful for your research and applications, please cite using this BibTeX:

@inproceedings{cha2024visually,
      title={Visually Dehallucinative Instruction Generation}, 
      author={Cha, Sungguk and Lee, Jusung and Lee, Younghyun and Yang, Cheoljong},
      booktitle={IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
      year={2024},
}

Licenses

This work, instructions, used COCO-Caption dataset (CC BY-NC-ND license) for the caption source and ChatGPT (refer OpenAI policies, https://openai.com/policies).