RA-DIT: Retrieval-Augmented Dual Instruction Tuning
September 18, 2025 · View on GitHub
This is the official artifact release of the follwoing paper:
Xi Victoria Lin*, Xilun Chen*, Mingda Chen*, Weijia Shi, Maria Lomeli, Rich James, Pedro Rodriguez, Jacob Kahn, Gergely Szilvasy, Mike Lewis, Luke Zettlemoyer, Wen-tau Yih. RA-DIT: Retrieval-Augmented Dual Instruction Tuning, ICLR 2024.

Overview
We are releasing the following artifacts from the paper.
- The scripts for processing the fine-tuning datasets into the format used in the paper.
- The DRAGON (Lin et al. 2023) embeddings of the Wiki 2021 index† used for fine-tuning datasets generation and evaluation.
†For the benchmark datasets used in our paper, using the Wiki 2021 index retains the majority of performance benefits, comparable to using the full retrieval index that combines both Wikipedia and Common Crawl data (Appendix E).
Conda Setup
We recommend cloning the Atlas codebase (Izcard et al. 2022) and starting from its conda building instructions outlined here, since we will be using its retriever implementation.
If you plan to use your own retriever and just need to run data generation, you may keep using an existing conda or create a vanilla Python environment:
conda create --name radit-env python=3.8
conda activate radit-env
Lastly, install RA-DIT dependencies in your conda
pip install -r requirements.txt
Retriever Setup (WIP)
We used the distributed index implementation by Izacard et al. 2022 for passage retrieval†.
†You can serve the passages and embeddings using any efficient vector search library, such as faiss.
Step 1: Clone and install Atlas (Izcard et al. 2022)
git clone https://github.com/facebookresearch/atlas.git
Install the tool following these instructions.
Step 2: Download Wikipedia 2021 passages and embeddings
We use the same Wikipedia 2021 passages as processed by Izacard et al. 2022. You can download the passages following these instructions (choose download key corpora/wiki/enwiki-dec2021).
Download the DRAGON embeddings using the following link.
| Source | Download URL |
|---|---|
| Wiki 2021 + info box | Link |
The passages and embedding vectors are 1:1 mapped.
Step 3: Launch and test the retriever index
WIP
Dataset Preparation
We recommend creating a root directory on your disk to save all the intermediate and final processed datasets, e.g.
export DATA_ROOT=/tmp/radit_datasets/
mkdir $DATA_ROOT
Step 1: Download Hugging Face Datasets
All our training and evaluation task datasets can be downloaded from Hugging Face. To download the dataset,
python3 src/data/download_datasets.py --root_dump_dir $DATA_ROOT/raw
Step 2: Prepare Language Model Fine-tuning Data
- Run offline retrieval
PYTHONPATH=. python3 src/data/lm_ft/prepare_lm_ft_datasets.py --input_dir $DATA_ROOT/raw --save_dir $DATA_ROOT/raw.ra --action retrieval
- Format and save retrieval-augmented (RA) instruction-tuning (IT) data
PYTHONPATH=. python3 src/data/lm_ft/prepare_lm_ft_datasets.py --input_dir $DATA_ROOT/raw --save_dir $DATA_ROOT/ft.ra --n_support 3 --action format_data
- Format and save instruction-tuning (IT) data with no retrieval
PYTHONPATH=. python3 src/data/lm_ft/prepare_lm_ft_datasets.py --input_dir $DATA_ROOT/raw --save_dir $DATA_ROOT/ft --action format_data --ignore_retrieved_passages
Step 3: Prepare Retriever Fine-tuning Data
PYTHONPATH=. python3 src/data/retriever_ft/prepare_retriever_ft_datasets.py --input-dir $DATA_ROOT/raw --save-dir $DATA_ROOT/retriever_ft/lsr_score_saving_input/ --task-list retriever_train
Citation
If you find the resource in this repository helpful, please cite
@inproceedings{DBLP:conf/iclr/Lin0CSL00KSLZY24,
author = {Xi Victoria Lin and
Xilun Chen and
Mingda Chen and
Weijia Shi and
Maria Lomeli and
Richard James and
Pedro Rodriguez and
Jacob Kahn and
Gergely Szilvasy and
Mike Lewis and
Luke Zettlemoyer and
Wen{-}tau Yih},
title = {{RA-DIT:} Retrieval-Augmented Dual Instruction Tuning},
booktitle = {The Twelfth International Conference on Learning Representations,
{ICLR} 2024, Vienna, Austria, May 7-11, 2024},
publisher = {OpenReview.net},
year = {2024},
url = {https://openreview.net/forum?id=22OTbutug9},
timestamp = {Wed, 07 Aug 2024 17:11:53 +0200},
biburl = {https://dblp.org/rec/conf/iclr/Lin0CSL00KSLZY24.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}
License
The code in this repository is licensed under CC-BY-NC.