PresentAgent-2: Towards Generalist Multimodal Presentation Agents

June 5, 2026 · View on GitHub

This is the code repository for the paper:

PresentAgent-2: Towards Generalist Multimodal Presentation Agents

Wei Wu*, Ziyang Xu*, Zeyu Zhang*†, Yang Zhao, Hao Tang‡

*Equal contribution. †Project lead. ‡Corresponding author.

Paper | Website | PresentEval | HF Paper

Note

To learn more about PresentAgent-2, please see the following presentation video, which was generated by PresentAgent-2 without any manual curation.

https://github.com/user-attachments/assets/2d780896-ffac-4f13-a928-0e6b313b2717

Citation

If you use any content of this repo for your work, please cite the following paper:

@article{wu2026presentagent2,
  title={PresentAgent-2: Towards Generalist Multimodal Presentation Agents},
  author={Wu, Wei and Xu, Ziyang and Zhang, Zeyu and Zhao, Yang and Tang, Hao},
  journal={arXiv preprint arXiv:2605.11363},
  year={2026}
}

We present PresentAgent-2, an agentic framework that transforms open-ended user queries into narrated presentation videos. Unlike prior document-to-presentation systems that assume a complete source document as input, PresentAgent-2 begins from a short natural-language query and actively determines what should be explained, retrieves reliable textual and multimodal resources, and constructs a coherent presentation-style video. To achieve this, PresentAgent-2 employs a modular pipeline that first refines the query into a focused topic, conducts deep research over candidate sources, extracts multimodal materials including text, images, GIFs, and videos, and then plans the presentation structure, generates slides and scripts, synthesizes narration, and composes the final video with aligned audio-visual content. Importantly, dynamic media such as GIFs and videos are preserved during composition rather than converted into static screenshots, enabling richer and more expressive presentation pages. PresentAgent-2 further supports three complementary delivery modes within a unified framework: single-speaker presentation, multi-speaker discussion, and grounded interactive Q&A. To evaluate this new query-to-presentation video setting, we build a multimodal presentation benchmark covering single presentation, discussion, and interaction scenarios, assessing general presentation quality, multimodal media use, discussion effectiveness, and interaction grounding. These contributions demonstrate the potential of query-driven multimodal agents to transform brief user intents into structured, dynamic, and audience-oriented presentation videos.

Run Your PresentAgent-2

1. Install & Requirements

conda create -n presentagent2 python=3.11
conda activate presentagent2
pip install -r requirements.txt
cd presentagent/MegaTTS3

Model Download

The pretrained TTS checkpoints can be found at Hugging Face. Please download them and put them under:

presentagent/MegaTTS3/checkpoints/xxx

Requirements (for Linux)

# [Optional] Set GPU
export CUDA_VISIBLE_DEVICES=0

# Set Python path for PresentAgent-2
export PYTHONPATH="/path/to/PresentAgent-2:/path/to/PresentAgent-2/presentagent/MegaTTS3:$PYTHONPATH"

# Make sure ffmpeg and libreoffice/soffice are available in PATH
which ffmpeg
which libreoffice

Requirements (for Windows)

conda install -y -c conda-forge pynini==2.1.5
pip install WeTextProcessing==1.0.3

# [Optional] Install GPU-specific PyTorch if needed
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu126

# Install ffmpeg if needed
conda install -y -c conda-forge ffmpeg

# Make sure LibreOffice is installed for PPT-to-video conversion
# and soffice is available in PATH, or set SOFFICE_CMD manually.

# Set Python path for PresentAgent-2
set PYTHONPATH="C:\path\to\PresentAgent-2;C:\path\to\PresentAgent-2\presentagent\MegaTTS3;%PYTHONPATH%"
$env:PYTHONPATH="C:\path\to\PresentAgent-2;C:\path\to\PresentAgent-2\presentagent\MegaTTS3;%PYTHONPATH%"

# [Optional] Set GPU
set CUDA_VISIBLE_DEVICES=0
$env:CUDA_VISIBLE_DEVICES=0

2. Run End-to-End Query-to-Video Pipeline

The main public entrypoint in this repository is:

scripts/run_url_to_video_pipeline.py

This script supports both:

query -> DeepResearch -> top-3 HTML selection -> url -> source.md -> refined_doc.json -> pptx -> video
url -> source.md -> refined_doc.json -> pptx -> video

Query to Video

python scripts/run_url_to_video_pipeline.py \
  --question "xxx" \
  --deepresearch-root /path/to/DeepResearch \
  --output-root /path/to/output/presentagent2_demo \
  --template-pptx /path/to/build_effective_agents.pptx \
  --notes-mode discussion \
  --num-slides 8 \
  --deepresearch-conda-env deepresearch \
  --max-wait-seconds 900 \
  --poll-interval-seconds 30

URL to Video

python scripts/run_url_to_video_pipeline.py \
  --url "xxx" \
  --output-root /path/to/output/presentagent2_from_url \
  --template-pptx /path/to/build_effective_agents.pptx \
  --notes-mode discussion \
  --num-slides 8

3. Output Structure

After running the pipeline, the output directory will contain:

output_root/
├── url_to_source/
│   ├── source.md
│   ├── report.log
│   └── candidates/
├── source_to_document/
│   ├── refined_doc.json
│   ├── document_overview.txt
│   └── media_summary.json
├── document_to_ppt/
│   └── <notes_mode>/
│       └── final_<notes_mode>.pptx
├── ppt_to_video/
│   └── <notes_mode>/
│       └── output.mp4
└── pipeline_summary.json

4. InteractionGUI

PresentAgent-2 supports a third delivery mode — interactive Q&A — via the InteractionGUI module. Given a generated presentation video and its accompanying document, audiences can ask natural-language questions and receive text + synthesized speech answers, while the video automatically seeks to the relevant section.

Setup

Install backend dependencies

cd presentagent/InteractionGUI
pip install -r requirements.txt

Configure environment variables

cp .env.example .env

Edit .env:

# LLM API configuration (OpenAI-compatible)
ANTHROPIC_AUTH_TOKEN=your_api_key_here
ANTHROPIC_PROVIDER=your_provider_here
ANTHROPIC_BASE_URL=https://api.example.com/v1
ANTHROPIC_MODEL=your_model_name
ANTHROPIC_TEMPERATURE=0.7

# Knowledge base — document used by the agent to answer questions
SOURCE_MD_PATH=./source/xxx/source.md

# Audio output directory
TTS_OUTPUT_DIR=./tts_output

Install frontend dependencies

cd presentagent/InteractionGUI/frontend
npm install

Running

Backend (FastAPI)

cd presentagent/InteractionGUI
python main_api.py

Frontend (Next.js)

cd presentagent/InteractionGUI/frontend
npm run dev

Usage

InteractionGUI

Input topic & generate video — enter a topic to generate a presentation video (single or discussion mode) via the pipeline
Select a video — click "Select video" to load the generated .mp4/.webm/.mov file
Upload a document — click "Select document" to upload the corresponding .md, .txt, or .json knowledge-base file
Ask questions — type in the Q&A panel; the agent replies with text + synthesized audio, and the video seeks to the relevant section
Play audio — click the floating audio player to hear the AI's spoken answer

Presentation Benchmark

Query-to-Presentation Benchmark

To support the evaluation of query-driven presentation generation, we curate a benchmark that measures whether an agent can transform an open-ended user request into a grounded, understandable, and multimodal presentation video. The benchmark covers the three presentation modes of PresentAgent-2:

single presentation
discussion
interaction

It emphasizes source discovery, multimodal content grounding, presentation quality, and final video delivery.

PresentEval-2

To assess the quality of generated presentation videos, we adopt two complementary evaluation strategies: Objective Quiz Evaluation and Subjective Scoring.

For automatic evaluation, we follow the protocol described in our benchmark setting. Each generated video is evaluated from two perspectives: objective knowledge delivery and subjective mode-specific quality.

In Objective Quiz Evaluation, the VLM acts as an audience member and answers five multiple-choice questions by watching the generated video and using the transcript transcribed from the generated video’s audio, resulting in a quiz score from 0 to 5.
In Subjective Scoring, the VLM judge assigns independent 1–5 scores to each generated result according to the three metrics defined for the corresponding presentation mode.
We report average quiz scores, the mean subjective score computed from the three mode-specific metrics, and the individual metric scores over examples for each mode and model.

For Objective Quiz Evaluation, we construct five multiple-choice questions for each example based on the corresponding reference presentation video and user query. A representative example across the three presentation modes is shown below, where the correct options are highlighted in bold.

Mode	Question	Options
Single Presentation	What is the main idea of flow matching in generative modeling?	A. Learning a fixed dataset classifier B. Matching a continuous transformation path C. Compressing images into discrete tokens D. Training without any learned dynamics
Discussion Presentation	What key contrast distinguishes diffusion models from flow matching?	A. Diffusion removes noise; flow matching learns a transformation path B. Both methods only classify images C. Flow matching requires no training objective D. Diffusion models cannot generate samples
Interaction Presentation	When an audience member asks why flow matching can be more efficient than diffusion models, what is the best answer?	A. Flow matching avoids modeling data transformations. B. Flow matching replaces generation with classification. C. Flow matching learns a continuous path and often needs fewer sampling steps. D. Flow matching only works for Gaussian distributions.

For Subjective Scoring, the mode-specific metrics are:

Mode	Metric	Criterion
Single Presentation	Query Answering	Directly answers the query and covers key topic concepts.
	Deep Research Effectiveness	Uses relevant textual and multimodal resources to support the explanation.
	Video Delivery Quality	Delivers coherent content through slides, narration, and visuals.
Discussion Presentation	Discussion Effectiveness	Uses dialogue to clarify, compare, and extend the presented ideas.
	Speaker Role Complementarity	Maintains complementary speaker roles for questioning, explaining, and summarizing.
	Conversational Delivery	Provides natural, coherent, and easy-to-follow conversation.
Interaction Presentation	Answer Effectiveness	Answers audience questions correctly and directly.
	Content Comprehensibility	Provides clear, understandable, and unambiguous answers.
	Interaction Helpfulness	Offers useful clarification that supports audience understanding.

🧪 Experiment

✳️ Comparative Study

Method	Model	Single Quiz	Single Mean	Discussion Quiz	Discussion Mean	Interaction Quiz	Interaction Mean
Human Reference	Human-created	4.82	4.46	4.83	4.40	--	--
PresentAgent‑2	Qwen3.5‑VL‑Plus	4.84	4.47	4.85	4.37	4.85	4.52
PresentAgent‑2	Claude Opus 4.7	4.80	4.43	4.82	4.38	4.80	4.52
PresentAgent‑2	Gemini 3.1 Pro	4.78	4.35	4.80	4.25	4.75	4.45
PresentAgent‑2	GPT‑5.5	4.83	4.25	4.77	4.17	4.75	4.46
PresentAgent‑2	GLM‑4.7V	4.75	4.18	4.67	4.11	4.60	4.42