Visual Generation in the New Era: An Evolution from Atomic Mapping to Agentic World Modeling

May 8, 2026 · View on GitHub

Five-level taxonomy of visual generation

Visual Generation in the New Era: An Evolution from Atomic Mapping to Agentic World Modeling

arXiv Page GitHub HF Paper

This repository hosts a living roadmap on modern visual generation. The project organizes recent progress in image generation and editing around a capability-oriented view of visual intelligence: moving from one-shot appearance synthesis toward controllable composition, persistent context, agentic interaction, and causal world modeling.

A companion Visual Generation Roadmap website is available, which carries a richer visualization of the taxonomy, the modern research landscape, and the full gallery of stress-test cases. The roadmap is intended to grow with the community: if you have a paper that should be included, or notice a missing reference or mis-classification, please feel free to open a pull request or an issue, and we will keep updating both the survey and the website accordingly. If you find any part of this work useful or interesting, we would also be very happy if you consider citing it.

Core Thesis

Recent visual generation models have improved photorealism and instruction following, but stronger images do not automatically imply stronger visual intelligence. The next bottlenecks are structural, temporal, and causal: models must preserve identity, obey spatial constraints, render exact symbols, reason over external data, interact through closed loops, and verify that generated artifacts satisfy the intended constraints.

We frame this evolution as a five-level progression:

LevelCapabilityShort Description
L1Atomic GenerationOne-shot probabilistic rendering from prompts or latent codes.
L2Conditional GenerationFaithful generation under explicit controls, layouts, references, or constraints.
L3In-Context GenerationMulti-reference, multi-condition, and long-context generation with persistent state.
L4Agentic GenerationMulti-call planning, generation, verification, rollback, and tool use.
L5World-Modeling GenerationCausal, physical, and action-conditioned simulation of visual worlds.

What Is in This Repo

Roadmap at a Glance

The roadmap argues that progress is no longer a single axis of image fidelity. It is a nested expansion of capability:

  1. Modeling moves from GANs to diffusion, flow matching, autoregressive modeling, and hybrid AR-diffusion systems.
  2. Architecture converges toward tokenizers/VAEs, transformer backbones, condition modules, and multimodal fusion mechanisms.
  3. Training shifts from scale alone to data density, VLM relabeling, continued training, SFT, preference optimization, and deployment acceleration.
  4. Applications increasingly demand verifiable constraints: exact text, layout, identity, domain rules, external data, and physical interaction.
  5. Evaluation must move from perceptual similarity toward parsers, OCR, graph validators, simulators, theorem checkers, and red-team agents.

Selected Figures

TopicFigure
Research landscapeResearch landscape
Modeling paradigmsModeling paradigms
Closed-source agentic systemsClosed-source agentic systems
Training pipelineTraining pipeline
Data pipelineData pipeline

Stress-Test Examples

Standard metrics can miss failures that matter. This repo includes selected qualitative cases where outputs are visually polished but violate geometric, topological, physical, or procedural constraints.

TestTarget CapabilityTypical Failure
Jigsaw reconstructionSpatial structuringHallucinates plausible content instead of rigidly reassembling pieces.
Metro mapGraph/topology followingProduces a convincing map but violates transfer and crossing constraints.
Isometric tile mapCoordinate groundingPlaces objects in nearby but incorrect grid cells.
Fluid dynamicsCausal state transitionMust distinguish plausible appearance from physically faithful intervention.
Multi-turn editingPersistent identity and constraint memory across turnsDrifts in identity, layout, or previously satisfied constraints as edits accumulate; later turns silently undo earlier ones.
Long-form text renderingExact symbolic rendering and typographyGenerates near-correct glyphs with character-level errors, swapped digits, or inconsistent fonts in long strings.
Counting and quantityNumerical groundingProduces a visually plausible scene with the wrong number of instances when the prompt specifies an exact count.
Occlusion and depth ordering3D-consistent compositional reasoningRenders objects with mutually inconsistent occlusion or depth cues that violate a single 3D layout.
Compositional bindingAttribute-to-entity bindingSwaps or merges colors, materials, and parts across multiple bound entities in the same scene.

The full gallery, including more multi-turn editing cases, is hosted on the project page; see docs/stress_tests.md for additional details.

Reference Organization

The full bibliography is maintained in references/citation.bib. The list below follows the roadmap sections and uses an awesome-list style: each entry gives the concrete paper name, a paper link when available (preferably arXiv), venue/year, and a short role in the roadmap.

Sec. 1: Motivation and New-Era Visual Generation

Sec. 2: Five-Level Taxonomy of Visual Intelligence

Sec. 3: Modeling Paradigms and Architectures

Sec. 4: Training, Alignment, and Acceleration

Sec. 5: Data, Benchmarks, and Infrastructure

Sec. 6: Applications and Evolving Frontiers

Sec. 7: In-the-Wild Stress Tests

  • Low-complexity single-image super-resolution based on nonnegative neighbor embedding (BMVC, 2012) — Classical super-resolution benchmark used for low-level restoration.
  • Deep retinex decomposition for low-light enhancement (arXiv, 2018) — Low-light enhancement benchmark.
  • A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics (ICCV, 2001) — Berkeley segmentation dataset used in denoising/restoration evaluation.
  • Deep joint rain detection and removal from a single image (CVPR, 2017) — Rain removal benchmark for deraining stress tests.
  • Deep multi-scale convolutional neural network for dynamic scene deblurring (CVPR, 2017) — Dynamic-scene deblurring benchmark.

Sec. 8: Future Directions

  • Beyond Simple Edits: X-Planner for Complex Instruction-Based Image Editing (arXiv, 2025) — Planning-based complex instruction image editing.
  • MIRA: Multimodal Iterative Reasoning Agent for Image Editing (arXiv, 2025) — Multimodal iterative reasoning agent for editing.
  • Image Editing As Programs with Diffusion Models (arXiv, 2025) — Programmatic view of image editing with diffusion models.
  • AI-Generated Images as Data Source: The Dawn of Synthetic Era (arXiv, 2023) — Position paper on AI-generated images as synthetic training data.
  • Recurrent world models facilitate policy evolution (NeurIPS, 2018) — Early recurrent world model for policy evolution.
  • Mastering diverse control tasks through world models (Nature, 2025) — Generalist world-model RL across diverse control tasks.
  • A path towards autonomous machine intelligence (Open Review, 2022) — Predictive world-modeling agenda for autonomous intelligence.
  • Genie: Generative Interactive Environments (ICML, 2024) — Generative interactive environments from unlabeled videos.
  • Diffusion for world modeling: Visual details matter in Atari (NeurIPS, 2024) — Diffusion-based Atari world modeling.
  • Oasis: A universe in a transformer (Technical Report, 2024) — Transformer-based interactive Minecraft-like world model.
  • GameGen-X: Interactive open-world game video generation (ICLR, 2025) — Interactive open-world game video generation.
  • World simulation with video foundation models for physical AI (arXiv, 2025) — Video foundation model for physical-AI world simulation.
  • ST-Raptor: LLM-Powered Semi-Structured Table Question Answering (SIGMOD, 2026) — Semi-structured table question answering with hierarchical trees.
  • MoDora: Tree-Based Semi-Structured Document Analysis System (SIGMOD, 2026) — Tree-based semi-structured document analysis.
  • FDABench: A Benchmark for Data Agents on Analytical Queries over Heterogeneous Data (arXiv, 2025) — Data-agent benchmark over heterogeneous analytical queries.
  • Synthesizing Natural Language to Visualization (NL2VIS) Benchmarks from NL2SQL Benchmarks (SIGMOD, 2021) — Natural-language-to-visualization benchmark synthesis.
  • DataVisT5: A Pre-Trained Language Model for Jointly Understanding Text and Data Visualization (ICDE, 2025) — Unified model for text and data visualization understanding.

Community suggestions are welcome — please open a pull request or an issue with the paper you would like to see added, and we will keep folding new entries into the roadmap.

Citation

If you find this roadmap useful, please cite the project. A formal arXiv citation will be added once available.

@article{wu2026visual,
  title={Visual Generation in the New Era: An Evolution from Atomic Mapping to Agentic World Modeling},
  author={Wu, Keming and Yang, Zuhao and Zhang, Kaichen and Wang, Shizun and Zhu, Haowei and Leng, Sicong and Yang, Zhongyu and Wang, Qijie and Wang, Sudong and Wang, Ziting and others},
  journal={arXiv preprint arXiv:2604.28185},
  year={2026}
}

⭐ Star History

Star History Chart