Method 2: Using python -m vllm.entrypoints.openai.api_server

August 28, 2025 Β· View on GitHub

ComoRAG: A Cognitive-Inspired Memory-Organized RAG for Stateful Long Narrative Reasoning

Python CUDA Platform RAG LLM Status arXiv License DeepWiki

English | δΈ­ζ–‡

ComoRAG Overview

πŸ“– Paper Information

This is the official implementation of the paper:

ComoRAG: A Cognitive-Inspired Memory-Organized RAG for Stateful Long Narrative Reasoning

Citation:

@article{wang2025comorag,
  title={ComoRAG: A Cognitive-Inspired Memory-Organized RAG for Stateful Long Narrative Reasoning},
  author={Wang, Juyuan and Zhao, Rongchen and Wei, Wei and Wang, Yufeng and Yu, Mo and Zhou, Jie and Xu, Jin and Xu, Liyan},
  journal={arXiv preprint arXiv:2508.10419},
  year={2025}
}

Project Introduction

ComoRAG is a retrieval-augmented generation (RAG) framework designed for long-document and multi-document tasks, including question answering, information extraction, and knowledge graph construction. It integrates large language models, embedding techniques, graph-based reasoning, and evaluation methodologies, making it suitable for both academic research and real-world applications.

πŸ”₯ What makes ComoRAG different?

Narrative comprehension on long stories and novels is hard due to intricate plotlines and evolving character/entity relations. LLMs struggle with extended context and cost, so retrieval stays crucial. However, classic RAG is often stateless and single-step, missing the dynamic nature of long-range, interconnected reasoning.

ComoRAG takes a cognition-inspired approach: narrative reasoning is not one-shot, but a dynamic, evolving interplay between new evidence acquisition and consolidation of past knowledge β€” analogous to memory processes in the brain. 🧠

  • πŸ” Iterative Reasoning Cycles: When hitting an impasse, ComoRAG launches cycles that interact with a dynamic memory workspace.
  • πŸ•΅οΈ Probing Queries: Each cycle generates targeted probes to explore new evidence paths.
  • 🧳 Global Memory Pool: Newly retrieved evidence is integrated into a shared memory pool to progressively build coherent context for the query.

πŸš€ Benchmarks & Gains: On four challenging long-context narrative benchmarks (200K+ tokens), ComoRAG outperforms strong RAG baselines with consistent relative gains up to 11% over the strongest baseline. It particularly shines on complex queries requiring global comprehension, enabling principled, cognitively motivated, stateful retrieval-based reasoning. πŸ“ˆ

Key idea in one line: Reason β†’ Probe β†’ Retrieve β†’ Consolidate β†’ Resolve. 🧩


Key Features ✨

  • 🧠 Support for various LLMs and local/remote embedding models
  • πŸ•ΈοΈ Graph-augmented retrieval and reasoning
  • πŸ”§ Flexible data preprocessing and chunking
  • πŸ“Š Multiple evaluation metrics (F1, EM, etc.)
  • 🧱 Modular and extensible design

Directory Structure πŸ“‚

ComoRAG/
β”œβ”€β”€ main_openai.py                       # Main program using OpenAI API
β”œβ”€β”€ main_vllm.py                         # Main program using local vLLM server
β”œβ”€β”€ script/                              # Data processing and evaluation scripts
β”‚   β”œβ”€β”€ chunk_doc_corpus.py              # Document chunking script
β”‚   └── eval_qa.py                       # QA evaluation script
β”œβ”€β”€ dataset/                             # Dataset directory
β”‚   └── ...
β”œβ”€β”€ src/comorag/                        # Core code
β”‚   β”œβ”€β”€ ComoRAG.py                       # Main class and core logic
β”‚   β”œβ”€β”€ utils/                           # Utility modules
β”‚   β”œβ”€β”€ embedding_model/                 # Embedding model related
β”‚   β”œβ”€β”€ llm/                             # LLM related
β”‚   β”œβ”€β”€ prompts/                         # Prompt templates
β”‚   β”œβ”€β”€ information_extraction/          # Information extraction
β”‚   └── rerank.py, embedding_store.py    # Other core modules
β”œβ”€β”€ requirements.txt                     # Dependencies
└── README.md                            # Project documentation

Installation & Environment πŸ› οΈ

  1. 🐍 Python version: Python 3.10 or above recommended
  2. πŸ“¦ Install dependencies:
pip install -r requirements.txt
  1. πŸ”‘ Environment variables: Set your OpenAI API Key or local LLM/embedding paths as needed
  2. βš™οΈ GPU (optional but recommended): CUDA 12.x supported by many dependencies in requirements.txt

Data Preparation & Format πŸ“„

  • πŸ“š Corpus file corpus.jsonl: Each line is a document, with fields like id, doc_id, title, contents
  • ❓ QA file qas.jsonl: Each line is a question, with fields like id, question, golden_answers

Example:

corpus.jsonl:

{"id": 0, "doc_id": 1, "title": "...", "contents": "..."}

qas.jsonl:

{"id": "1", "question": "...", "golden_answers": ["..."]}

Quick Start ⚑

Method 1: Using OpenAI API (main_openai.py) πŸš€

  1. Configure dataset path and model parameters in the script:
config = BaseConfig(
    llm_base_url='https://api.example.com/v1',  # OpenAI API
    llm_name='gpt-4o-mini',
    dataset='cinderella',
    embedding_model_name='/path/to/your/embedding/model',
    embedding_batch_size=32,
    need_cluster=True,  # Enable Semantic/Episodic enhancement
    output_dir='result/cinderella',
    save_dir='outputs/cinderella',
    max_meta_loop_max_iterations=5,
    is_mc=False,  # Multiple-choice?
    max_tokens_ver=2000,  # Veridical layer tokens
    max_tokens_sem=2000,  # Semantic layer tokens
    max_tokens_epi=2000   # Episodic layer tokens
)
  1. Run the main program ▢️:
python main_openai.py

Method 2: Using Local vLLM Server (main_vllm.py) ⚑

1. Start vLLM Server πŸš€

First, start the vLLM OpenAI-compatible API server:

# Method 1: Using vllm serve command
vllm serve /path/to/your/model \
  --tensor-parallel-size 1 \
  --max-model-len 4096 \
  --gpu-memory-utilization 0.95

# Method 2: Using python -m vllm.entrypoints.openai.api_server
python -m vllm.entrypoints.openai.api_server \
  --model /path/to/your/model \
  --served-model-name your-model-name \
  --tensor-parallel-size 1 \
  --max-model-len 32768 \
  --dtype auto

Parameter descriptions:

  • --model: Model path (e.g., /path/to/your/model)
  • --tensor-parallel-size: Number of GPU parallel processes
  • --max-model-len: Maximum model length
  • --gpu-memory-utilization: GPU memory utilization rate

2. Configure main_vllm.py πŸ“

Modify the configuration in main_vllm.py:

# vLLM server configuration
vllm_base_url = 'http://localhost:8000/v1'  # vLLM server address
served_model_name = '/path/to/your/model'    # Model path

config = BaseConfig(
    llm_base_url=vllm_base_url,
    llm_name=served_model_name,
    llm_api_key="your-api-key-here",  # Any value, local server doesn't need real API key
    dataset='cinderella',
    embedding_model_name='/path/to/your/embedding/model',
    embedding_batch_size=4,
    need_cluster=True,
    output_dir='result/cinderella_vllm',
    save_dir='outputs/cinderella_vllm',
    max_meta_loop_max_iterations=5,
    is_mc=False,
    max_tokens_ver=2000,
    max_tokens_sem=2000,
    max_tokens_epi=2000
)

3. Run the Program ▢️

python main_vllm.py

4. Check Server Status πŸ”

Ensure the vLLM server is running properly:

# Check if port is occupied
netstat -tlnp | grep 8000

# Test API connection
curl http://localhost:8000/v1/models

Comparison of Two Methods πŸ“Š

FeatureOpenAI API (main.py)vLLM Local (main_vllm.py)
CostPay per tokenOne-time model download
SpeedNetwork latencyLocal inference, faster
PrivacyData sent to cloudCompletely local processing
SetupSimple, just API keyRequires GPU and model files
StabilityNetwork dependentLocal control
  1. πŸ“ Results will be saved under result/<dataset>/<subset>/

Main Modules

  • πŸ›οΈ ComoRAG.py: The main class, responsible for retrieval, graph construction, reasoning, and QA
  • 🧰 utils/: Configuration, logging, embedding, clustering, summarization, memory, agents, and other utilities
  • 🧲 embedding_model/: Embedding model adaptation and loading
  • πŸ€– llm/: LLM adaptation
  • πŸ—’οΈ prompts/: Prompt template management
  • πŸ“¦ embedding_store.py: Embedding vector storage and retrieval

Data Processing & Evaluation Scripts πŸ§ͺ

  • βœ‚οΈ script/chunk_doc_corpus.py: Document chunking, supports token/word/sentence/recursive methods
  • πŸ“ˆ script/eval_qa.py: Automatic QA result evaluation, supports EM, F1, and other metrics

Example usage:

Chunking documents βœ‚οΈ:

python script/chunk_doc_corpus.py \
  --input_path dataset/<name>/<subset>/corpus.jsonl \
  --output_path dataset/<name>/<subset>/corpus_chunked.jsonl \
  --chunk_by token \
  --chunk_size 512 \
  --tokenizer_name_or_path /path/to/your/tokenizer

Evaluate QA results πŸ“Š:

python script/eval_qa.py /path/to/result/<dataset>/<subset>

This produces files like ``details、results.json`, etc.


Known Issues & TODO πŸ“

πŸ”§ Current Issues

  • Remote Embedding Model Tokenizer Issue: When using remote vLLM-deployed embedding models, fails due to missing local model files. Need to support local tokenizer for remote embedding models.

πŸš€ Planned Features

  • Support for more embedding model providers (Azure OpenAI, etc.)

Contact & Contribution 🀝

For questions or suggestions, feel free to submit an Issue or PR.


Acknowledgement πŸ™

We refer to the repository of HippoRAG as a skeleton code.


Star History ⭐

Star History Chart