Paper Chatbot

December 23, 2025 · View on GitHub

This project builds a local RAG chatbot over parsed research paper chunks. It:

  • Generates OpenAI embeddings for paper chunks and stores them in a local ChromaDB.
  • Serves a Gradio web UI to chat with the papers.

The project's code is 100% generated by prompting GPT-5.1-Codex. It was originally designed for building a chatbot for papers listed in Awesome Graph/Transformer Fraud Detection, but one can use it to build any chatbot given pre-chunked corpus.

Setup

Use Python 3.11 or 3.12 when creating your virtual environment.

  1. Create a virtual environment and install dependencies.
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
  1. Set your OpenAI API key (and optional overrides).
export OPENAI_API_KEY="..."

Optional overrides are listed in .env.example. If you create a .env file in the project root, it will be loaded automatically.

The chunked corpus file

Ensure data/papers.json exists before building the Chroma index. This repository does not include a corpus builder script, so generate the file externally or download a corpus of ~250 fraud-detetion-related papers parsed by MinerU.

Build the Chroma index

You can either build the chroma index using the following command with any embedding model you choose. Or you can download a prebuilt Chroma DB based on OpenAI's text-embedding-3-large. Similar to papers.json, you need to put the unzipped directory /chroma under /data.

python scripts/build_chroma.py --corpus data/papers.json --chroma-dir data/chroma

Useful flags:

  • --reset to rebuild the collection from scratch.
  • --skip-existing to avoid re-embedding chunks already in Chroma.
  • --batch-size 100 to tune embedding batches.

Run the chatbot

python app.py

Open the local Gradio URL shown in your terminal.

Demo

Here is the screenshot of the chatbot using GPT-5.2-Chat

demo

Below is another QA example on comparing two papers.

Click to expand

user: Can you summarize the key difference between the two papers?

  1. DiffGAD: A Diffusion-based Unsupervised Graph Anomaly Detector
  2. Conditional Diffusion Anomaly Modeling on Graphs

assistant: Based on the provided context, the key difference between the two papers lies in how they use diffusion models for graph anomaly detection—specifically, whether diffusion is used as a data augmentation and reconstruction tool (DiffGAD) or as a direct generative modeling framework for anomaly inference (CGADM).


1. DiffGAD: A Diffusion-based Unsupervised Graph Anomaly Detector

Core idea:

  • Purpose: Transfers diffusion models from generative tasks to unsupervised anomaly detection.
  • Mechanism:
    • Operates in the latent space of the graph.
    • Uses two diffusion models: one for general content and one for common content.
    • The difference between the two captures discriminative content that helps identify anomalies.
    • Employs a content-preservation strategy to maintain information consistency.
  • Detection principle:
    • Nodes that are harder to reconstruct (i.e., with higher reconstruction error) are considered anomalous.
  • Role of diffusion:
    • Diffusion acts as a reconstruction and feature refinement process to improve discriminative representation.
  • Setting:
    • Unsupervised—no labeled anomalies are required.
  • Datasets:
    • Evaluated on six real-world, large-scale datasets.

2. Conditional Diffusion Anomaly Modeling on Graphs (CGADM)

Core idea:

  • Purpose: Introduces a conditional generative diffusion model that directly models the joint distribution of anomalies on graphs.
  • Mechanism:
    • Incorporates a prior-guided diffusion process, where a pre-trained conditional anomaly estimator guides both forward and reverse diffusion.
    • Uses a prior confidence-aware mechanism to adaptively control the number of reverse denoising steps for efficiency.
  • Detection principle:
    • Rather than reconstructing and comparing errors, CGADM directly generates anomaly judgments through joint distribution modeling.
  • Role of diffusion:
    • Diffusion is used as a model-centric generative process, not just for data augmentation or reconstruction.
  • Setting:
    • Focuses on conditional modeling—leveraging prior anomaly estimators for guided diffusion.
  • Datasets:
    • Evaluated on Elliptic, Tolokers, and YelpChi datasets.

3. Key Conceptual Difference

AspectDiffGADCGADM
ParadigmReconstruction-based, unsupervised detectorGenerative, model-centric conditional detector
Diffusion RoleUsed to reconstruct and distill discriminative latent featuresUsed to model the joint distribution of anomalies directly
Guidance MechanismDual diffusion models (general vs. common content)Prior-guided conditional diffusion with adaptive denoising
Detection BasisReconstruction error and discriminative contentDirect anomaly probability estimation from generative model
SupervisionUnsupervisedConditional (uses a pre-trained anomaly estimator as prior)
GoalEnhance discriminative representation for anomaly detectionDirectly generate anomaly judgments through conditional modeling

In summary:

  • DiffGAD treats diffusion as a latent-space reconstruction and feature discrimination tool for unsupervised anomaly detection.
  • CGADM redefines diffusion as a conditional generative modeling framework that directly infers anomalies by learning their joint distribution with graph structure and features.

Thus, DiffGAD = reconstruction-based detector, while CGADM = generative, model-centric anomaly modeler.