CellAgent: LLM-Driven Multi-Agent Framework for Automated scRNA-Seq Data Analysis

June 11, 2026 · View on GitHub

An intelligent multi-agent framework powered by Large Language Models (LLMs) to automate single-cell RNA sequencing (scRNA-seq) data analysis tasks. The framework decomposes complex analysis workflows into manageable steps through three primary agent roles: Planner, Executor, and Evaluator.

Overview

CellAgent leverages the capabilities of LLMs to understand biological analysis requirements and automatically orchestrate the execution of appropriate data processing and analysis steps. It provides an interactive, iterative approach to scRNA-seq analysis with built-in quality evaluation and self-optimization mechanisms.

Architecture

Core Components

The framework consists of three main agent roles:

1. Planner (src/planner.py)

  • Decomposes user-provided analysis tasks into structured, executable steps
  • Understands the biological context and data characteristics
  • Generates detailed task plans in JSON format
  • Maps high-level biological requirements to specific analytical procedures

2. Executor (src/executor.py)

  • Executes each planned step with precision
  • Comprises two sub-components:
    • Tool Selector (src/tool_selector.py): Intelligently selects appropriate analysis tools based on task requirements
    • Code Programmer (src/code_programmer.py): Generates executable Python code for bioinformatics tasks
  • Manages iterative optimization with automatic retry mechanisms

3. Evaluator (src/evaluator.py)

  • Assesses the quality and correctness of generated code execution results
  • Provides expert-level evaluation based on biological principles
  • Generates improvement suggestions for failed or suboptimal analyses
  • Determines whether results meet user requirements

Supporting Components

  • Global Memory (src/memory.py): Maintains analysis context and code history across all steps
  • Code Sandbox (src/code_sandbox.py): Executes generated code in a Jupyter Notebook environment with safety isolation
  • Tool Registry (src/tools/tool_registry.py): Manages available bioinformatics tools and their documentation

Features

Key Capabilities

  • Automated Task Planning: Decomposes complex scRNA-seq analysis into logical steps
  • Intelligent Tool Selection: Automatically chooses appropriate tools for each analytical task
  • Automatic Code Generation: Generates Python code tailored to specific analysis requirements
  • Iterative Self-Optimization: Automatically improves code execution through multiple attempts (default: 2, up to 3 for batch effect correction)
  • Quality Evaluation: Expert-level assessment of results with improvement recommendations
  • Jupyter Integration: All analysis code and results are organized in Jupyter Notebooks
  • Context Memory: Maintains global analysis context to ensure coherent multi-step workflows
  • Error Handling: Graceful error management with automatic recovery mechanisms

Installation

Prerequisites

  • Python 3.8+
  • Ollama (for local LLM) or OpenAI API key (for GPT-4)
  • Jupyter Notebook
  • Bioinformatics analysis libraries (scanpy, etc.)

Dependencies

pip install langchain langchain-community langchain-openai
pip install scanpy pandas numpy scipy scikit-learn
pip install jupyter notebook nbconvert

Setup

  1. Clone the repository:
git clone https://github.com/liu-shiqiang/CellAgent.git
cd CellAgent
  1. Install required packages:
pip install -r requirements.txt
  1. Set up LLM configuration:

Option A: Local LLM (Ollama)

# Install Ollama from https://ollama.ai
# Pull the required model
ollama pull llama3.1
# Start Ollama server
ollama serve

Option B: OpenAI API Update the LLM initialization in main.py with your API key

Usage

Quick Start

python main.py

The program will:

  1. Prompt you to enter your scRNA-seq analysis task
  2. Request the path to your scRNA-seq data file (H5AD format)
  3. Automatically:
    • Load and analyze your data
    • Generate an analysis plan
    • Execute each step with automatic optimization
    • Save results to a Jupyter Notebook

Example Workflow

# Step 1: User Input
# Task: "Perform quality control and cell type annotation on scRNA-seq data"
# Data path: "/path/to/data.h5ad"

# Step 2: System automatically:
# - Plans: [QC step, Normalization, Dimensionality reduction, Clustering, Annotation, ...]
# - Executes each step with evaluation and optimization
# - Saves all code and visualizations to analysis.ipynb

Data Format

Supported Input Format

  • H5AD files (.h5ad): AnnData objects compatible with scanpy
  • Data should contain gene expression matrix and relevant metadata

Output

  • Jupyter Notebook (examples/notebooks/analysis.ipynb): Complete analysis workflow with code, visualizations, and results

Configuration

LLM Settings

In main.py, modify LLM configuration:

# Local LLM (default)
llm = Ollama(model='llama3.1', base_url='http://localhost:11434')

# Or use OpenAI
# llm = ChatOpenAI(model_name='gpt-4', temperature=0)

Code Sandbox Configuration

Update the notebook path in main.py:

code_sandbox = CodeSandbox(notebook_path='/your/path/to/analysis.ipynb')

Retry Attempts

Configure maximum retry attempts per step:

max_attempts = 2  # Default
# For batch effect correction: automatically set to 3

Project Structure

CellAgent/
├── main.py                      # Entry point
├── src/
│   ├── __init__.py
│   ├── planner.py               # Task decomposition
│   ├── executor.py              # Step execution orchestration
│   ├── evaluator.py             # Result quality assessment
│   ├── tool_selector.py         # Tool selection logic
│   ├── code_programmer.py       # Code generation
│   ├── code_sandbox.py          # Jupyter execution environment
│   ├── memory.py                # Context management
│   ├── tools/
│   │   └── tool_registry.py     # Available tools registry
│   └── utils/
│       └── json_utils.py        # JSON parsing utilities
└── examples/
    └── notebooks/
        └── analysis.ipynb       # Generated analysis results

Workflow Diagram

User Input (Task + Data)

Planner (Decompose into steps)

For Each Step:
    ├─ Tool Selector (Choose tools)
    ├─ Code Programmer (Generate code)
    ├─ Code Sandbox (Execute code)
    ├─ Evaluator (Assess quality)
    └─ Self-Optimize if needed ↻

Output: Jupyter Notebook with Complete Analysis

Performance Considerations

  • First attempt success rate: Depends on LLM quality and task complexity
  • Typical execution time: 5-30 minutes per analysis (depends on data size and steps)
  • Memory requirements: 8GB+ RAM recommended for large datasets
  • GPU acceleration: Optional but recommended for faster execution

Supported Analysis Tasks

  • Quality control and filtering
  • Normalization and batch effect correction
  • Dimensionality reduction (PCA, UMAP, t-SNE)
  • Clustering and cell type annotation
  • Differential expression analysis
  • Gene ontology enrichment
  • Trajectory inference
  • Cell-cell interaction analysis

Troubleshooting

Issue: Data file not found

Solution: Verify the exact file path. Use absolute paths or ensure the relative path is correct.

Issue: LLM connection timeout

Solution:

  • For Ollama: Ensure the service is running (ollama serve)
  • Check the base URL is correct

Issue: Code execution fails in sandbox

Solution:

  • Check the generated code in the Notebook
  • Verify data format and compatibility
  • Increase max_attempts for problematic steps

Advanced Usage

Custom Tool Registry

Add custom analysis tools to src/tools/tool_registry.py:

class ToolRegistry:
    def get_available_tools(self):
        return {
            "custom_tool": {
                "name": "Custom Analysis Tool",
                "documentation": "Detailed documentation..."
            }
        }

Memory Management

Access global memory during execution:

global_memory.add_code(code)
previous_codes = global_memory.get_all_codes()

Contributing

Contributions are welcome! Areas for improvement:

  • Additional bioinformatics tools integration
  • Enhanced evaluation metrics
  • Performance optimization
  • Better error handling and recovery

Citation

If you use CellAgent in your research, please cite:

@software{cellagent2024,
  author = {Liu, Shiqiang},
  title = {CellAgent: LLM-Driven Multi-Agent Framework for Automated scRNA-Seq Data Analysis},
  year = {2024},
  url = {https://github.com/liu-shiqiang/CellAgent}
}

License

This project is open-source and available under the MIT License.

Contact & Support

Acknowledgments

Built with:

Roadmap

  • Web UI for easier task input
  • Support for additional data formats (Zarr, Parquet)
  • Cloud deployment templates
  • Enhanced visualization library
  • Multi-dataset analysis support
  • Real-time progress tracking
  • Result export to multiple formats (HTML, PDF, etc.)

Last Updated: October 2024
Version: 1.0.0