๐คก JOOD (CVPR 2025)
June 11, 2025 ยท View on GitHub
Official implementation for "Playing the Fool: Jailbreaking LLMs and Multimodal LLMs with Out-of-Distribution Strategy"
Joonhyun Jeong1,2, Seyun Bae2, Yeonsung Jung2, Jaeryong Hwang3, Eunho Yang2,4
1 NAVER Cloud, ImageVision
2 KAIST
3 Republic of Korea Naval Academy
4 AITRICS
๐ ๏ธ Install
- Python >= 3.12.7
- Required libraries are listed in
requirements.txt
๐ Dataset
Download AdvBench-M dataset from [Google Drive].
Format the dataset directory structure as below:
datasets/
โโโ AdvBenchM/
โโโ images/
โ โโโ harmful/
โ โโโ harmless/
โ โโโ harmless_text/
โโโ prompts/
โ โโโ all_instructions
โ โโโ all_instructions_harmful_annotated
โ โโโ eval_all_instructions
โโโ scenario_def.json
โโโ scenario_repr.json
๐ How to Attack?
Text Attacks
bash scripts/text_attacks/attack_gpt4.sh
Multimodal Attacks
bash scripts/multimodal_attacks/attack_gpt4.sh
- for attack with Typography images, modify
--harmless_image_dirtodatasets/AdvBenchM/images/harmless_text
Supported Target Models
We currently support the following target attack models. You can set the target_model in the script as shown below:
gpt-4-turbo-2024-04-09gpt-4o-2024-08-06o1-2024-12-17qwenvl2(Qwen/Qwen2-VL-7B-Instruct)
๐ก Note: For OpenAI models, ensure that you set the correct openai_key in all the scripts.
๐ How to evaluate?
Inference for ASR
bash scripts/evaluation/eval_llama_guard.sh
-
Make sure to modify
eval_datetimeandaugto match the settings you used for the attack. -
You need to set your huggingface access token
HF_TOKENin the script. -
To run the ASR evaluation using the Meta-Llama-Guard-2-8B model, you must first agree to its license terms provided by Meta:
๐ Meta Llama Guard 2 License Agreement
- And then, set
model_idtometa-llama/Meta-Llama-Guard-2-8Bin the models/llm_guard.py
- And then, set
Inference for Harmfulness
bash scripts/evaluation/eval_gpt4.sh
- Make sure to modify
eval_datetimeandaugto match the settings you used for the attack. - You need to set your openai key
openai_keyin the script.
Report Metrics
python3 evaluate_metrics.py --eval_dir [YOUR_RESULT_DIR]
- Set
--eval_dirwith your result directory (e.g.,datasets/AdvBenchM/outputs/2024_08_28_06_47_30_mixup)
๐๏ธ News
2025/02/27: got accepted in CVPR'25 :partying_face:2025/06/11: open JOOD code
๐ Citation
If you find that this project helps your research, please consider citing as below:
@inproceedings{jeong2025playing,
title={Playing the fool: Jailbreaking llms and multimodal llms with out-of-distribution strategy},
author={Jeong, Joonhyun and Bae, Seyun and Jung, Yeonsung and Hwang, Jaeryong and Yang, Eunho},
booktitle={Proceedings of the Computer Vision and Pattern Recognition Conference},
pages={29937--29946},
year={2025}
}
๐ Acknowledgements
We gratefully acknowledge the following projects and datasets, which our work builds upon:
- AdvBench โ for the design of harmful instruction scenarios.
- AdvBench-M โ for the image-based multimodal jailbreak evaluation data.
License
JOOD
Copyright (c) 2025-present NAVER Cloud Corp.
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.