Logic-RAG
May 16, 2025 ยท View on GitHub
This is the implementation of paper: "Logic-RAG: Augmenting Large Multimodal Models with Visual-Spatial Knowledge for Road Scene Understanding" (accepted to ICRA2025) [Link]
Test Logic-RAG on Google Colab:
Overview
Logic-RAG is a novel Retrieval-Augmented Generation framework that enhances Large Multimodal Models' (LMMs) spatial reasoning capabilities in autonomous driving contexts. By constructing a dynamic knowledge base of object-object relationships in first-order logic (FOL), Logic-RAG significantly improves the accuracy of visual-spatial queries in driving scenarios.
This repository contains the complete implementation of the Logic-RAG framework, including the perception module, query-to-logic embedder, and logical inference engine. We provide inference scripts for our proposed framework along with baseline evaluation scripts for GPT-4V and Claude 3.5 models to facilitate comparison.
Logic-RAG framework for visual-spatial reasoning:

Installation
Requirements
- Python 3.9
- PyTorch 2.4.1
- CUDA compatible GPU
Setup Environment
We provide a Conda environment file for easy setup:
# Clone the repository
git clone https://github.com/Imran2205/LogicRAG.git
cd LogicRAG
# Create and activate conda environment
conda env create -f environment.yaml
conda activate lrag
After activating the environment, compile the following script for the Mask2Former model:
cd perception_module/segmentation/Mask2Former-mod/modeling/pixel_decoder/ops
sh make.sh
Download Pre-trained Models and Precomputed Knowledge Base
Download all the pre-computed knowledge base for KITTI and model weights from the following link: Pre-trained Models & KB (Google Drive)
Extract the downloaded file to the root of this repository.
Alternatively, you can download and extract the file using the following commands:
gdown 14vRnzCDD9rw63Os3w-Ii4ZF_352y__km
unzip LogicRAG_Data.zip
Download KITTI Dataset
Download the left color images of KITTI object tracking dataset from KITTI website.
Test Logic-RAG Using Precomputed Data
To run the inference using precomputed knowledge base and tracker trajectories, run:
cd kb_framework/kb_inference
python inference_in_kb.py \
--csv ../kitti_questions/kitti_que.csv \
--fol_trans_csv ./translated_queries/question_query_kitti_llama33.csv \
--kb_dir ../../LogicRAG_Data/precomputed_knowledge_base/kb_out_kitti \
--tracker_dir ../../LogicRAG_Data/tracker_trajectories/track_out_kitti \
--output logic_rag_response.csv
Usage
The Logic-RAG pipeline consists of several components that need to be run sequentially to generate the knowledge base and perform inference.
steps.
1. Perception Module
1.1 Segmentation (Mask2Former)
Generate semantic segmentation masks for scene understanding using the following commands:
cd perception_module/segmentation/Mask2Former-mod
mkdir ../../../track_in_kitti_test/
python demo/demo.py --config-file ./configs/cityscapes/panoptic-segmentation/swin/maskformer2_swin_large_IN21k_384_bs16_90k.yaml \
--input /data/datasets/KITTI/MOT/testing/image_02/*/*.png \
--output ../../../track_in_kitti_test/ \
--opts MODEL.WEIGHTS ../../../LogicRAG_Data/model_weights/Mask2Former/model_final_064788.pkl
1.2 Depth Estimation (PixelFormer)
To generate depth maps for the scene, run:
cd perception_module/depth/PixelFormer-mod
python pixelformer/test_lrag.py configs/arguments_test_kitti_depth.txt
The configuration file should contain:
--model_name pixelformer_kittieigen
--encoder large07
--dataset kitti
--input_height 352
--input_width 1216
--max_depth 80
--do_kb_crop
--data_path /data/datasets/KITTI/MOT/testing
--filenames_file ./data_splits/kitti_lrag_test.txt
--checkpoint_path ../../../LogicRAG_Data/model_weights/PixelFormer/kitti.pth
--output ../../../track_in_kitti_test/
Please replace data_path value with the actual path to KITTI MOT test dataset.
Note: You can either use ./data_splits/kitti_lrag_test.txt as filenames_file without modification
or
ensure your filenames_file contains lines in the format:
image_02/0000/000000.png
image_02/0000/000001.png
Or,
image_02/0000/000000.png image_02/0000/000000.png 1000.0
image_02/0000/000001.png image_02/0000/000001.png 1000.0
1.3 Optical Flow Estimation (CoTracker3)
Run the following commands to generate optical flow data:
cd perception_module/optical_flow/co-tracker
python demo_frame.py \
--dataset_root /data/datasets/KITTI/MOT/testing/image_02 \
--output_dir ../../../track_in_kitti_test/ \
--checkpoint ../../../LogicRAG_Data/model_weights/CoTracker3/scaled_offline.pth \
--sample_interval 5 \
--dataset KITTI
1.4 Key Point Extraction
Extract static key points to serve as anchors for multilateration during tracking:
cd perception_module/tracking
python key_point_extractor.py \
--data_dir ../../track_in_kitti_test/ \
--output_dir ../../track_in_kitti_test/ \
--distance_threshold 0.8 \
--sample_interval 20 \
--max_depth 50 \
--dataset KITTI
1.5 Object Instance Tracking
Track objects across frames and save the trajectories to ../../track_out_kitti_test:
cd perception_module/tracking
python new_tracker.py \
--data_dir ../../track_in_kitti_test/ \
--output_dir ../../track_out_kitti_test \
--type_model_path ../../LogicRAG_Data/model_weights/TypeDetector/best_type_model.pth \
--color_model_path ../../LogicRAG_Data/model_weights/ColorDetector/best_color_model.pth \
--num_processes 30 \
--rgb_base_dir /data/datasets/KITTI/MOT/testing/image_02 \
--dataset KITTI
2. Knowledge Base Generation
Generate the FOL knowledge base from tracking data and save to ../../kb_out_kitti_test:
cd kb_framework/kb_generator
python final_KB_gen.py \
--tracker_dir ../../track_out_kitti_test \
--output_dir ../../kb_out_kitti_test \
--logic_pad ../LogicPad/LogicPad.yaml \
--window_size 10 \
--dataset KITTI \
--num_processes 30
3. Query-to-Logic Translation
Use the notebook kb_framework/kb_inference/fol_translate_llama.ipynb to translate natural language questions
to first-order logic. The translations will be saved to a CSV file
(e.g., ./kb_framework/kb_inference/translated_queries/question_query_kitti_llama33.csv).
Note: We precomputed the translations csv file. so that we do not need to load LlaMa 3 each time we run inference.
4. Inference
Run inference using the generated knowledge base:
cd kb_framework/kb_inference
python inference_in_kb.py \
--csv ../kitti_questions/kitti_que.csv \
--fol_trans_csv ./translated_queries/question_query_kitti_llama33.csv \
--kb_dir ../../kb_out_kitti_test \
--tracker_dir ../../track_out_kitti_test \
--output logic_rag_response.csv
To run the inference using precomputed knowledge base and tracker trajectories run:
cd kb_framework/kb_inference
python inference_in_kb.py \
--csv ../kitti_questions/kitti_que.csv \
--fol_trans_csv ./translated_queries/question_query_kitti_llama33.csv \
--kb_dir ../../LogicRAG_Data/precomputed_knowledge_base/kb_out_kitti \
--tracker_dir ../../LogicRAG_Data/tracker_trajectories/track_out_kitti \
--output logic_rag_response.csv
Baseline Models (Optional)
GPT-4V
Generate responses using GPT-4V for comparison:
cd baselines
python get_gpt4_response.py \
--csv /path/to/kitti_questions/kitti_que.csv \
--frames_dir /path/to/kitti/MOT/testing/image_02 \
--output gpt4v_response.csv \
--log_dir ./logs
Claude 3.5
Generate responses using Claude 3.5 for comparison:
cd baselines
python get_claude_response.py \
--csv /path/to/kitti_questions/kitti_que.csv \
--frames_dir /path/to/kitti/MOT/testing/image_02 \
--output claude_response.csv \
--log_dir ./logs
Dataset
We evaluate Logic-RAG on both synthetic and real-world driving scenes:
- Synthetic Driving Scenes: Generated using a controlled environment to test specific spatial reasoning capabilities.
- Real-World Driving Scenes: Derived from the KITTI Multi-Object Tracking dataset. It provides challenging real-world scenarios.
Results
Logic-RAG significantly improves spatial reasoning accuracy:
- Baseline LMMs: GPT-4V and Claude 3.5 achieved only 55% accuracy on synthetic scenes and <75% on real-world scenes.
- Logic-RAG augmented: Increased accuracies to >80% on synthetic and >90% on real-world scenes.
- Ablation study: Even without logical inference, fact-based context from Logic-RAG improved accuracy by 15%.
Citation
If you find our work useful in your research, please consider citing:
@inproceedings{kabir2025logicrag,
title={Logic-RAG: Augmenting Large Multimodal Models with Visual-Spatial Knowledge for Road Scene Understanding},
author={Kabir, Imran and Reza, Md Alimoor and Billah, Syed},
booktitle={2025 IEEE International Conference on Robotics and Automation (ICRA)},
year={2025},
organization={IEEE}
}
License
Acknowledgements
We thank the authors of the following projects whose code we have adapted or used: