PresentAgent-2: Towards Generalist Multimodal Presentation Agents
June 5, 2026 · View on GitHub
This is the code repository for the paper:
PresentAgent-2: Towards Generalist Multimodal Presentation Agents
Wei Wu*, Ziyang Xu*, Zeyu Zhang*†, Yang Zhao, Hao Tang‡
*Equal contribution. †Project lead. ‡Corresponding author.
Paper | Website | PresentEval | HF Paper
Note
To learn more about PresentAgent-2, please see the following presentation video, which was generated by PresentAgent-2 without any manual curation.
https://github.com/user-attachments/assets/2d780896-ffac-4f13-a928-0e6b313b2717
Citation
If you use any content of this repo for your work, please cite the following paper:
@article{wu2026presentagent2,
title={PresentAgent-2: Towards Generalist Multimodal Presentation Agents},
author={Wu, Wei and Xu, Ziyang and Zhang, Zeyu and Zhao, Yang and Tang, Hao},
journal={arXiv preprint arXiv:2605.11363},
year={2026}
}
News
Todo List
- code release
- paper release
- project page refinement
- demo video release
- benchmark release
- local deployment guide
Introduction
We present PresentAgent-2, an agentic framework that transforms open-ended user queries into narrated presentation videos. Unlike prior document-to-presentation systems that assume a complete source document as input, PresentAgent-2 begins from a short natural-language query and actively determines what should be explained, retrieves reliable textual and multimodal resources, and constructs a coherent presentation-style video. To achieve this, PresentAgent-2 employs a modular pipeline that first refines the query into a focused topic, conducts deep research over candidate sources, extracts multimodal materials including text, images, GIFs, and videos, and then plans the presentation structure, generates slides and scripts, synthesizes narration, and composes the final video with aligned audio-visual content. Importantly, dynamic media such as GIFs and videos are preserved during composition rather than converted into static screenshots, enabling richer and more expressive presentation pages. PresentAgent-2 further supports three complementary delivery modes within a unified framework: single-speaker presentation, multi-speaker discussion, and grounded interactive Q&A. To evaluate this new query-to-presentation video setting, we build a multimodal presentation benchmark covering single presentation, discussion, and interaction scenarios, assessing general presentation quality, multimodal media use, discussion effectiveness, and interaction grounding. These contributions demonstrate the potential of query-driven multimodal agents to transform brief user intents into structured, dynamic, and audience-oriented presentation videos.

Run Your PresentAgent-2
1. Install & Requirements
conda create -n presentagent2 python=3.11
conda activate presentagent2
pip install -r requirements.txt
cd presentagent/MegaTTS3
Model Download
The pretrained TTS checkpoints can be found at Hugging Face. Please download them and put them under:
presentagent/MegaTTS3/checkpoints/xxx
Requirements (for Linux)
# [Optional] Set GPU
export CUDA_VISIBLE_DEVICES=0
# Set Python path for PresentAgent-2
export PYTHONPATH="/path/to/PresentAgent-2:/path/to/PresentAgent-2/presentagent/MegaTTS3:$PYTHONPATH"
# Make sure ffmpeg and libreoffice/soffice are available in PATH
which ffmpeg
which libreoffice
Requirements (for Windows)
conda install -y -c conda-forge pynini==2.1.5
pip install WeTextProcessing==1.0.3
# [Optional] Install GPU-specific PyTorch if needed
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu126
# Install ffmpeg if needed
conda install -y -c conda-forge ffmpeg
# Make sure LibreOffice is installed for PPT-to-video conversion
# and soffice is available in PATH, or set SOFFICE_CMD manually.
# Set Python path for PresentAgent-2
set PYTHONPATH="C:\path\to\PresentAgent-2;C:\path\to\PresentAgent-2\presentagent\MegaTTS3;%PYTHONPATH%"
$env:PYTHONPATH="C:\path\to\PresentAgent-2;C:\path\to\PresentAgent-2\presentagent\MegaTTS3;%PYTHONPATH%"
# [Optional] Set GPU
set CUDA_VISIBLE_DEVICES=0
$env:CUDA_VISIBLE_DEVICES=0
2. Run End-to-End Query-to-Video Pipeline
The main public entrypoint in this repository is:
scripts/run_url_to_video_pipeline.py
This script supports both:
query -> DeepResearch -> top-3 HTML selection -> url -> source.md -> refined_doc.json -> pptx -> videourl -> source.md -> refined_doc.json -> pptx -> video
Query to Video
python scripts/run_url_to_video_pipeline.py \
--question "xxx" \
--deepresearch-root /path/to/DeepResearch \
--output-root /path/to/output/presentagent2_demo \
--template-pptx /path/to/build_effective_agents.pptx \
--notes-mode discussion \
--num-slides 8 \
--deepresearch-conda-env deepresearch \
--max-wait-seconds 900 \
--poll-interval-seconds 30
URL to Video
python scripts/run_url_to_video_pipeline.py \
--url "xxx" \
--output-root /path/to/output/presentagent2_from_url \
--template-pptx /path/to/build_effective_agents.pptx \
--notes-mode discussion \
--num-slides 8
3. Output Structure
After running the pipeline, the output directory will contain:
output_root/
├── url_to_source/
│ ├── source.md
│ ├── report.log
│ └── candidates/
├── source_to_document/
│ ├── refined_doc.json
│ ├── document_overview.txt
│ └── media_summary.json
├── document_to_ppt/
│ └── <notes_mode>/
│ └── final_<notes_mode>.pptx
├── ppt_to_video/
│ └── <notes_mode>/
│ └── output.mp4
└── pipeline_summary.json
4. InteractionGUI
PresentAgent-2 supports a third delivery mode — interactive Q&A — via the InteractionGUI module. Given a generated presentation video and its accompanying document, audiences can ask natural-language questions and receive text + synthesized speech answers, while the video automatically seeks to the relevant section.
Setup
Install backend dependencies
cd presentagent/InteractionGUI
pip install -r requirements.txt
Configure environment variables
cp .env.example .env
Edit .env:
# LLM API configuration (OpenAI-compatible)
ANTHROPIC_AUTH_TOKEN=your_api_key_here
ANTHROPIC_PROVIDER=your_provider_here
ANTHROPIC_BASE_URL=https://api.example.com/v1
ANTHROPIC_MODEL=your_model_name
ANTHROPIC_TEMPERATURE=0.7
# Knowledge base — document used by the agent to answer questions
SOURCE_MD_PATH=./source/xxx/source.md
# Audio output directory
TTS_OUTPUT_DIR=./tts_output
Install frontend dependencies
cd presentagent/InteractionGUI/frontend
npm install
Running
Backend (FastAPI)
cd presentagent/InteractionGUI
python main_api.py
Frontend (Next.js)
cd presentagent/InteractionGUI/frontend
npm run dev
Usage

- Input topic & generate video — enter a topic to generate a presentation video (single or discussion mode) via the pipeline
- Select a video — click "Select video" to load the generated
.mp4/.webm/.movfile - Upload a document — click "Select document" to upload the corresponding
.md,.txt, or.jsonknowledge-base file - Ask questions — type in the Q&A panel; the agent replies with text + synthesized audio, and the video seeks to the relevant section
- Play audio — click the floating audio player to hear the AI's spoken answer
Presentation Benchmark
Query-to-Presentation Benchmark
To support the evaluation of query-driven presentation generation, we curate a benchmark that measures whether an agent can transform an open-ended user request into a grounded, understandable, and multimodal presentation video. The benchmark covers the three presentation modes of PresentAgent-2:
- single presentation
- discussion
- interaction
It emphasizes source discovery, multimodal content grounding, presentation quality, and final video delivery.
PresentEval-2
To assess the quality of generated presentation videos, we adopt two complementary evaluation strategies: Objective Quiz Evaluation and Subjective Scoring.

For automatic evaluation, we follow the protocol described in our benchmark setting. Each generated video is evaluated from two perspectives: objective knowledge delivery and subjective mode-specific quality.
- In Objective Quiz Evaluation, the VLM acts as an audience member and answers five multiple-choice questions by watching the generated video and using the transcript transcribed from the generated video’s audio, resulting in a quiz score from 0 to 5.
- In Subjective Scoring, the VLM judge assigns independent 1–5 scores to each generated result according to the three metrics defined for the corresponding presentation mode.
- We report average quiz scores, the mean subjective score computed from the three mode-specific metrics, and the individual metric scores over examples for each mode and model.
For Objective Quiz Evaluation, we construct five multiple-choice questions for each example based on the corresponding reference presentation video and user query. A representative example across the three presentation modes is shown below, where the correct options are highlighted in bold.
| Mode | Question | Options |
|---|---|---|
| Single Presentation | What is the main idea of flow matching in generative modeling? |
A. Learning a fixed dataset classifier B. Matching a continuous transformation path C. Compressing images into discrete tokens D. Training without any learned dynamics |
| Discussion Presentation | What key contrast distinguishes diffusion models from flow matching? |
A. Diffusion removes noise; flow matching learns a transformation path B. Both methods only classify images C. Flow matching requires no training objective D. Diffusion models cannot generate samples |
| Interaction Presentation | When an audience member asks why flow matching can be more efficient than diffusion models, what is the best answer? |
A. Flow matching avoids modeling data transformations. B. Flow matching replaces generation with classification. C. Flow matching learns a continuous path and often needs fewer sampling steps. D. Flow matching only works for Gaussian distributions. |
For Subjective Scoring, the mode-specific metrics are:
| Mode | Metric | Criterion |
|---|---|---|
| Single Presentation |
Query Answering | Directly answers the query and covers key topic concepts. |
| Deep Research Effectiveness | Uses relevant textual and multimodal resources to support the explanation. | |
| Video Delivery Quality | Delivers coherent content through slides, narration, and visuals. | |
| Discussion Presentation |
Discussion Effectiveness | Uses dialogue to clarify, compare, and extend the presented ideas. |
| Speaker Role Complementarity | Maintains complementary speaker roles for questioning, explaining, and summarizing. | |
| Conversational Delivery | Provides natural, coherent, and easy-to-follow conversation. | |
| Interaction Presentation |
Answer Effectiveness | Answers audience questions correctly and directly. |
| Content Comprehensibility | Provides clear, understandable, and unambiguous answers. | |
| Interaction Helpfulness | Offers useful clarification that supports audience understanding. |
🧪 Experiment
✳️ Comparative Study
| Method | Model | Single Quiz | Single Mean | Discussion Quiz | Discussion Mean | Interaction Quiz | Interaction Mean |
|---|---|---|---|---|---|---|---|
| Human Reference | Human-created | 4.82 | 4.46 | 4.83 | 4.40 | -- | -- |
| PresentAgent‑2 | Qwen3.5‑VL‑Plus | 4.84 | 4.47 | 4.85 | 4.37 | 4.85 | 4.52 |
| PresentAgent‑2 | Claude Opus 4.7 | 4.80 | 4.43 | 4.82 | 4.38 | 4.80 | 4.52 |
| PresentAgent‑2 | Gemini 3.1 Pro | 4.78 | 4.35 | 4.80 | 4.25 | 4.75 | 4.45 |
| PresentAgent‑2 | GPT‑5.5 | 4.83 | 4.25 | 4.77 | 4.17 | 4.75 | 4.46 |
| PresentAgent‑2 | GLM‑4.7V | 4.75 | 4.18 | 4.67 | 4.11 | 4.60 | 4.42 |