CVPR-2025-Papers
June 30, 2025 · View on GitHub
会议时间:2025年6月11日至15日
会议网址:https://cvpr.thecvf.com/
❣❣❣ CVPR 2025 论文分类整理ing
查看2025年综述文献点这里↘️2025-CV-Surveys
2025 年论文分类汇总戳这里
↘️WACV-2025-Papers ↘️CVPR-2025-Papers ↘️ICCV-2025-Papers
2024 年论文分类汇总戳这里
2023 年论文分类汇总戳这里
2022 年论文分类汇总戳这里
2021 年论文分类汇总戳这里
2020 年论文分类汇总戳这里
❣❣❣ CVPR 2025 论文分类整理已完成
:loudspeaker::loudspeaker::loudspeaker:获奖论文
:trophy:最佳论文
:trophy:最佳学生论文
:trophy:最佳论文荣誉提名奖
- Navigation World Models
- 3D Student Splatting and Scooping
- MegaSaM: Accurate, Fast and Robust Structure and Motion from Casual Dynamic Videos
- Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models
:trophy:最佳学生论文荣誉提名奖
57.计算成像
- AnyCam: Learning to Recover Camera Poses and Intrinsics from Casual Videos
:star:code - Dynamic Camera Poses and Where to Find Them
:house:project - EquiPose: Exploiting Permutation Equivariance for Relative Camera Pose Estimation
- HyperPose: Hypernetwork-Infused Camera Pose Localization and an Extended Cambridge Landmarks Dataset
- FLARE: Feed-forward Geometry, Appearance and Camera Estimation from Uncalibrated Sparse Views
- 相机重定位
56.Multi-view Clustering
- AdaptCMVC: Robust Adaption to Incremental Views in Continual Multi-view Clustering
- Deep Fair Multi-View Clustering with Attention KAN
- Imputation-free and Alignment-free: Incomplete Multi-view Clustering Driven by Consensus Semantic Learning
- Medusa: A Multi-Scale High-order Contrastive Dual-Diffusion Approach for Multi-View Clustering
- A Hubness Perspective on Representation Learning for Graph-Based Multi-View Clustering
- EASEMVC:Efficient Dual Selection Mechanism for Deep Multi-View Clustering
- ROLL: Robust Noisy Pseudo-label Learning for Multi-View Clustering with Noisy Correspondence
- Enhanced then Progressive Fusion with View Graph for Multi-View Clustering
55.Retrieval-Augmented Generation(检索增强生成)
54.Animation(动画)
- AniDoc: Animation Creation Made Easier
- X-Dyna: Expressive Dynamic Human Image Animation
- EchoMimicV2: Towards Striking, Simplified, and Semi-Body Human Animation
- StableAnimator: High-Quality Identity-Preserving Human Image Animation
- Make-It-Animatable: An Efficient Framework for Authoring Animation-Ready 3D Characters
- PhysAnimator: Physics-Guided Generative Cartoon Animation
- Free-viewpoint Human Animation with Pose-correlated Reference Selection
- Consistent and Controllable Image Animation with Motion Diffusion Models
- Let's Chorus: Partner-aware Hybrid Song-Driven 3D Head Animation
- MotiF: Making Text Count in Image Animation with Motion Focal Loss
- FlipSketch: Flipping Static Drawings to Text-Guided Sketch Animations
- Diffusion-based Realistic Listening Head Generation via Hybrid Motion Modeling
- Unveiling Visual Perception in Language Models: An Attention Head Analysis Approach
- 肖像动画
- Sonic: Shifting Focus to Global Audio Perception in Portrait Animation
- Hallo3: Highly Dynamic and Realistic Portrait Image Animation with Video Diffusion Transformer
- High-Fidelity Relightable Monocular Portrait Animation with Lighting-Controllable Video Diffusion Model
- KeyFace: Expressive Audio-Driven Facial Animation for Long Sequences via KeyFrame Interpolation
- HunyuanPortrait: Implicit Condition Control for Enhanced Portrait Animation
:star:code - Wav2Sem: Plug-and-Play Audio Semantic Decoupling for 3D Speech-Driven Facial Animation
- Zero-1-to-A: Zero-Shot One Image to Animatable Head Avatars Using Video Diffusion
- Teller: Real-Time Streaming Audio-Driven Portrait Animation with Autoregressive Motion Generation
- MVPortrait: Text-Guided Motion and Emotion Control for Multi-view Vivid Portrait Animation
- MoEE: Mixture of Emotion Experts for Audio-Driven Portrait Animation
53.Sketch(草图)
- Sketch Down the FLOPs: Towards Efficient Networks for Human Sketch
- Image Referenced Sketch Colorization Based on Animation Creation Workflow
- SketchFusion: Learning Universal Sketch Features through Fusing Foundation Models
:star:code - SketchAgent: Language-Driven Sequential Sketch Generation
- 三维草图
52.Animal
- Recurrent Feature Mining and Keypoint Mixup Padding for Category-Agnostic Pose Estimation
:star:code - Probabilistic Prompt Distribution Learning for Animal Pose Estimation
:star:code - AniMo: Species-Aware Model for Text-Driven Animal Motion Generation
- AniMer: Animal Pose and Shape Estimation Using Family Aware Transformer
- Reconstructing Animals and the Wild
51.Protecting copyright(保护版权)
- CDI: Copyrighted Data Identification in Diffusion Models
- Harnessing Frequency Spectrum Insights for Image Copyright Protection Against Diffusion Models
- Vision-Language Model IP Protection via Prompt-based Learning
- 水印
- 3D-GSW: 3D Gaussian Splatting for Robust Watermarking
- GuardSplat: Efficient and Robust Watermarking for 3D Gaussian Splatting
- OmniGuard: Hybrid Manipulation Localization via Augmented Versatile Deep Image Watermarking
- Watermarking One for All: A Robust Watermarking Scheme Against Partial Image Theft
- EntropyMark: Towards More Harmless Backdoor Watermark via Entropy-based Constraint for Open-source Dataset Copyright Protection
- SleeperMark: Towards Robust Watermark against Fine-Tuning Text-to-image Diffusion Models
- Black-Box Forgery Attacks on Semantic Watermarks for Diffusion Models
50.Dense Prediction(密集预测)
- Unified Dense Prediction of Video Diffusion
- Frequency Dynamic Convolution for Dense Image Prediction
:star:code - A Unified Image-Dense Annotation Generation Model for Underwater Scenesdense prediction
49.Image Fusion(图像融合)
- DCEvo: Discriminative Cross-Dimensional Evolutionary Learning for Infrared and Visible Image Fusion
:star:code - Task-driven Image Fusion with Learnable Fusion Loss
- Binarized Neural Network for Multi-spectral Image Fusion
- Every SAM Drop Counts: Embracing Semantic Priors for Multi-Modality Image Fusion and Beyond
- Self-Learning Hyperspectral and Multispectral Image Fusion via Adaptive Residual Guided Subspace Diffusion Model
- One Model for ALL: Low-Level Task Interaction Is a Key to Task-Agnostic Image Fusion
:star:code - A Selective Re-learning Mechanism for Hyperspectral Fusion Imaging
48.Feature Matching(特征匹配)
- CoMatcher: Multi-View Collaborative Feature Matching
- JamMa: Ultra-lightweight Local Feature Matching with Joint Mamba
:star:code - FG^2: Fine-Grained Cross-View Localization by Fine-Grained Feature Matching
- EDM: Equirectangular Projection-Oriented Dense Kernelized Feature Matching
47.Industrial Anomaly Detection(工业缺陷检测)
- DefectFill: Realistic Defect Generation with Inpainting Diffusion Model for Visual Inspection
- Towards Training-free Anomaly Detection with Vision and Language Foundation Models
:star:code - The MVTec AD 2 Dataset: Advanced Scenarios for Unsupervised Anomaly Detection
:house:project
:house:project
:house:project - Wavelet and Prototype Augmented Query-based Transformer for Pixel-level Surface Defect Detection
- Multi-Sensor Object Anomaly Detection: Unifying Appearance, Geometry, and Internal Properties
- AnomalyNCD: Towards Novel Anomaly Class Discovery in Industrial Scenarios
- 异常检测
- One-for-More: Continual Diffusion Model for Anomaly Detection
:star:code - AA-CLIP: Enhancing Zero-shot Anomaly Detection via Anomaly-Aware CLIP
:star:code - Exploring Intrinsic Normal Prototypes within a Single Image for Universal Anomaly Detection
:star:code - Towards Visual Discrimination and Reasoning of Real-World Physical Dynamics: Physics-Grounded Anomaly Detection
- Distribution Prototype Diffusion Learning for Open-set Supervised Anomaly Detection
- UniVAD: A Training-free Unified Model for Few-shot Visual Anomaly Detection
- Unseen Visual Anomaly Generation
- PatchGuard: Adversarially Robust Anomaly Detection and Localization through Vision Transformers and Pseudo Anomalies
- Odd-One-Out: Anomaly Detection by Comparing with Neighbors
- Beyond Single-Modal Boundary: Cross-Modal Anomaly Detection through Visual Prototype and Harmonization
- PIAD: Pose and Illumination agnostic Anomaly Detection
- DFM: Differentiable Feature Matching for Anomaly Detection
- A Unified Latent Schrodinger Bridge Diffusion Model for Unsupervised Anomaly Detection and Localization
- TailedCore: Few-Shot Sampling for Unsupervised Long-Tail Noisy Anomaly Detection
- Bayesian Prompt Flow Learning for Zero-Shot Anomaly Detection
- Correcting Deviations from Normality: A Reformulated Diffusion Model for Multi-Class Unsupervised Anomaly Detection
- One-for-More: Continual Diffusion Model for Anomaly Detection
46.Neural Radiance Fields
- LookCloser: Frequency-aware Radiance Field for Tiny-Detail Scene
- RANGE: Retrieval Augmented Neural Fields for Multi-Resolution Geo-Embeddings
- LIRM: Large Inverse Rendering Model for Progressive Reconstruction of Shape, Materials and View-dependent Radiance Fields
- PBR-NeRF: Inverse Rendering with Physics-Based Neural Fields
- NeISF++: Neural Incident Stokes Field for Polarized Inverse Rendering of Conductors and Dielectrics
- Time of the Flight of the Gaussians: Optimizing Depth Indirectly in Dynamic Radiance Fields
- Joint Optimization of Neural Radiance Fields and Continuous Camera Motion from a Monocular Video
- RelationField: Relate Anything in Radiance Fields
- Depth-Guided Bundle Sampling for Efficient Generalizable Neural Radiance Field Reconstruction
- 视图合成
- EVolSplat: Efficient Volume-based Gaussian Splatting for Urban View Synthesis
- NexusGS: Sparse View Synthesis with Epipolar Depth Priors in 3D Gaussian Splatting
:star:code - SPC-GS: Gaussian Splatting with Semantic-Prompt Consistency for Indoor Open-World Free-view Synthesis from Sparse Inputs
:star:code - CoMapGS: Covisibility Map-based Gaussian Splatting for Sparse Novel View Synthesis
- Free360: Layered Gaussian Splatting for Unbounded 360-Degree View Synthesis from Extremely Sparse and Unposed Views
:star:code - LITA-GS: Illumination-Agnostic Novel View Synthesis via Reference-Free 3D Gaussian Splatting and Physical Priors
:star:code - NVComposer: Boosting Generative Novel View Synthesis with Multiple Sparse and Unposed Images
- MOVIS: Enhancing Multi-Object Novel View Synthesis for Indoor Scenes
- Novel View Synthesis with Pixel-Space Diffusion Models
- FrugalNeRF: Fast Convergence for Extreme Few-shot Novel View Synthesis without Learned Priors
- Zero-Shot Novel View and Depth Synthesis with Multi-View Geometric Diffusion
- AerialMegaDepth: Learning Aerial-Ground Reconstruction and View Synthesis
- GoLF-NRT: Integrating Global Context and Local Geometry for Few-Shot View Synthesis
- SimVS: Simulating World Inconsistencies for Robust View Synthesis
- EVPGS: Enhanced View Prior Guidance for Splatting-based Extrapolated View Synthesis
- StreetCrafter: Street View Synthesis with Controllable Video Diffusion Models
- 渲染
- Differentiable Inverse Rendering with Interpretable Basis BRDFs
- Channel-wise Noise Scheduled Diffusion for Inverse Rendering in Indoor Scenes
- TensoFlow: Tensorial Flow-based Sampler for Inverse Rendering
:star:code - MonoInstance: Enhancing Monocular Priors via Multi-view Instance Alignment for Neural Rendering and Reconstruction
:star:code - BizGen: Advancing Article-level Visual Text Rendering for Infographics Generation
:star:code - Diffusion Renderer: Neural Inverse and Forward Rendering with Video Diffusion Models
- 3D Convex Splatting: Radiance Field Rendering with 3D Smooth Convexes
- Sparse Voxels Rasterization: Real-time High-fidelity Radiance Field Rendering
- AMO Sampler: Enhancing Text Rendering with Overshooting
- 4D
- 4Deform: Neural Surface Deformation for Robust Shape Interpolation
- Dynamic Neural Surfaces for Elastic 4D Shape Representation and Analysis
:star:code - Uni4D: Unifying Visual Foundation Models for 4D Modeling from a Single Video
- DriveDreamer4D: World Models Are Effective Data Machines for 4D Driving Scene Representation
- Feature4X: Bridging Any Monocular Video to 4D Agentic AI with Versatile Gaussian Feature Fields
- MoSca: Dynamic Gaussian Fusion from Casual Videos via 4D Motion Scaffolds
- DIO: Decomposable Implicit 4D Occupancy-Flow World Model
- DNF: Unconditional 4D Generation with Dictionary-based Neural Fields
- 4D-Fly: Fast 4D Reconstruction from a Single Monocular Video
- CAT4D: Create Anything in 4D with Multi-View Video Diffusion Models
- Stereo4D: Learning How Things Move in 3D from Internet Stereo Videos
- GIFStream: 4D Gaussian-based Immersive Video with Feature Stream
- FIction: 4D Future Interaction Prediction from Video
- NTR-Gaussian: Nighttime Dynamic Thermal Reconstruction with 4D Gaussian Splatting Based on Thermodynamics
- DrivingSphere: Building a High-fidelity 4D World for Closed-loop Simulation
- 4DTAM: Non-Rigid Tracking and Mapping via Dynamic Surface Gaussians
- Unleashing the Potential of Multi-modal Foundation Models and Video Diffusion for 4D Dynamic Physical Scene Simulation
- Robust Multi-Object 4D Generation for In-the-wild Videos
- 4Real-Video: Learning Generalizable Photo-Realistic 4D Video Diffusion
45.Anomaly Detection(异常检测)
- OOD
- CADRef: Robust Out-of-Distribution Detection via Class-Aware Decoupled Relative Feature Leveraging
- Enhanced OoD Detection through Cross-Modal Alignment of Multi-Modal Representations
- ProHOC: Probabilistic Hierarchical Out-of-Distribution Classification via Multi-Depth Networks
:star:code - DPU: Dynamic Prototype Updating for Multimodal Out-of-Distribution Detection
- Dual Energy-Based Model with Open-World Uncertainty Estimation for Out-of-distribution Detection
- OODD: Test-time Out-of-Distribution Detection with Dynamic Dictionary
- Overcoming Shortcut Problem in VLM for Robust Out-of-Distribution Detection
- H2ST: Hierarchical Two-Sample Tests for Continual Out-of-Distribution Detection
- Beyond Clean Training Data: A Versatile and Model-Agnostic Framework for Out-of-Distribution Detection with Contaminated Training Data
- Leveraging Perturbation Robustness to Enhance Out-of-Distribution Detection
- Playing the Fool: Jailbreaking LLMs and Multimodal LLMs with Out-of-Distribution Strategy
- On the Out-Of-Distribution Generalization of Large Multimodal Models
- Detecting Out-of-Distribution Through the Lens of Neural Collapse
- Open Set Label Shift with Test Time Out-of-Distribution Reference
- Simplification Is All You Need against Out-of-Distribution Overconfidence
- 图像异常检测
44.Object Pose Estimation(物体姿态估计)
- Co-op: Correspondence-based Novel Object Pose Estimation
- GIVEPose: Gradual Intra-class Variation Elimination for RGB-based Category-Level Object Pose Estimation
:star:code - GCE-Pose: Global Context Enhancement for Category-level Object Pose Estimation
- UNOPose: Unseen Object Pose Estimation with an Unposed RGB-D Reference Image
- Rethinking Correspondence-based Category-Level Object Pose Estimation
- CRISP: Object Pose and Shape Estimation with Test-Time Adaptation
- 6D
- Any6D: Model-free 6D Pose Estimation of Novel Objects
:house:project - RefPose: Leveraging Reference Geometric Correspondences for Accurate 6D Pose Estimation of Unseen Objects
- UA-Pose: Uncertainty-Aware 6D Object Pose Estimation and Online Object Completion with Partial References
- ONDA-Pose: Occlusion-Aware Neural Domain Adaptation for Self-Supervised 6D Object Pose Estimation
- iG-6DoF: Model-free 6DoF Pose Estimation for Unseen Object via Iterative 3D Gaussian Splatting
- Leveraging Global Stereo Consistency for Category-Level Shape and 6D Pose Estimation from Stereo Images
- One2Any: One-Reference 6D Pose Estimation for Any Object
- Generating 6DoF Object Manipulation Trajectories from Action Description in Egocentric Vision
- Pos3R: 6D Pose Estimation for Unseen Objects Made Easy
- CAP-Net: A Unified Network for 6D Pose and Size Estimation of Categorical Articulated Parts from a Single RGB-D Image
- Any6D: Model-free 6D Pose Estimation of Novel Objects
43.Object Re-Id/Counting(计数)
- T2ICount: Enhancing Cross-modal Understanding for Zero-Shot Counting
:star:code - AirRoom: Objects Matter in Room Reidentification
- Single Domain Generalization for Few-Shot Counting via Universal Representation Matching
- 物体重识别
42.Graph Neural Network(GNN/GCN)
- Graph Neural Network Combining Event Stream and Periodic Aggregation for Low-Latency Event-based Vision
- Deterministic Certification of Graph Neural Networks against Graph Poisoning Attacks with Arbitrary Perturbations
41.Few/Zero-Shot Learning/DG/A(小/零样本/域泛化/域适应)
- FSL
- ZSL
- DG
- Generalized Diffusion Detector: Mining Robust Features from Diffusion Models for Domain-Generalized Detection
:star:code - Unlocking the Potential of Unlabeled Data in Semi-Supervised Domain Generalization
:star:code - OSLoPrompt: Bridging Low-Supervision Challenges and Open-Set Domain Generalization in CLIP
- When Domain Generalization meets Generalized Category Discovery: An Adaptive Task-Arithmetic Driven Approach
- Domain Generalization in CLIP via Learning with Diverse Text Prompts
- SoMA: Singular Value Decomposed Minor Components Adaptation for Domain Generalizable Representation Learning
- PEER Pressure: Model-to-Model Regularization for Single Source Domain Generalization
- Seeking Consistent Flat Minima for Better Domain Generalization via Refining Loss Landscapes
- Gradient-Guided Annealing for Domain Generalization
- Adversarial Domain Prompt Tuning and Generation for Single Domain Generalization
- Balanced Direction from Multifarious Choices: Arithmetic Meta-Learning for Domain Generalization
- TIDE: Training Locally Interpretable Domain Generalization Models Enables Test-time Correction
- Generalized Diffusion Detector: Mining Robust Features from Diffusion Models for Domain-Generalized Detection
- DA
- Distinguish Then Exploit: Source-free Open Set Domain Adaptation via Weight Barcode Estimation and Sparse Label Assignment
- Link-based Contrastive Learning for One-Shot Unsupervised Domain Adaptation
- Revisiting Source-Free Domain Adaptation: Insights into Representativeness, Generalization, and Variety
- ADU: Adaptive Detection of Unknown Categories in Black-Box Domain Adaptation
- MODfinity: Unsupervised Domain Adaptation with Multimodal Information Flow Intertwining
- Preserving Clusters in Prompt Learning for Unsupervised Domain Adaptation
- 广义类别发现
- GET: Unlocking the Multi-modal Potential of CLIP for Generalized Category Discovery
- Adaptive Part Learning for Fine-Grained Generalized Category Discovery: A Plug-and-Play Enhancement
- Less Attention is More: Prompt Transformer for Generalized Category Discovery
- MOS: Modeling Object-Scene Associations in Generalized Category Discovery
40.Deepfake Detection/AI生成图像检测
- FreqDebias: Towards Generalizable Deepfake Detection via Consistency-Driven Frequency Debiasing
- D^3: Scaling Up Deepfake Detection by Learning from Discrepancy
- SIDA: Social Media Image Deepfake Detection, Localization and Explanation with Large Multimodal Model
- Where the Devil Hides: Deepfake Detectors Can No Longer Be Trusted
- AI生成图像检测
- Towards Universal AI-Generated Image Detection by Variational Information Bottleneck Network
- A Bias-Free Training Paradigm for More General AI-generated Image Detection
- Any-Resolution AI-Generated Image Detection by Spectral Learning
- Beyond Generation: A Diffusion-based Low-level Feature Extractor for Detecting AI-generated Images
- Where's the Liability in the Generative Era? Recovery-based Black-Box Detection of AI-Generated Content
- Secret Lies in Color: Enhancing AI-Generated Images Detection with Color Distribution Analysis
- 伪造检测
- 伪造视频检测
39.Vision Transformers
- Split Adaptation for Pre-trained Vision Transformers
:star:code - BHViT: Binarized Hybrid Vision Transformer
- VGGT: Visual Geometry Grounded Transformer
:star:code
:star:code - ERUPT: Efficient Rendering with Unposed Patch Transformer
- Spiking Transformer:Introducing Accurate Addition-Only Spiking Self-Attention for Transformer
- Improving Adversarial Transferability on Vision Transformers via Forward Propagation Refinement
:star:code - Hypergraph Vision Transformers: Images are More than Nodes, More than Edges
- LibraGrad: Balancing Gradient Flow for Universally Better Vision Transformer Attributions
- Your Scale Factors are My Weapon: Targeted Bit-Flip Attacks on Vision Transformers via Scale Factor Manipulation
- Comprehensive Information Bottleneck for Unveiling Universal Attribution to Interpret Vision Transformers
- Prompt-CAM: Making Vision Transformers Interpretable for Fine-Grained Analysis
- SATA: Spatial Autocorrelation Token Analysis for Enhancing the Robustness of Vision Transformers
- DA-VPT: Semantic-Guided Visual Prompt Tuning for Vision Transformers
- CARE Transformer: Mobile-Friendly Linear Visual Transformer via Decoupled Dual Interaction
38.Dataset/Benchmark(数据集/基准)
- 基准
- MicroVQA: A Multimodal Reasoning Benchmark for Microscopy-Based Scientific Research
- Omnia de EgoTempo: Benchmarking Temporal Understanding of Multi-Modal LLMs in Egocentric Videos
:star:code - Mono2Stereo: A Benchmark and Empirical Study for Stereo Conversion
:star:code - Unbiasing through Textual Descriptions: Mitigating Representation Bias in Video Benchmarks
:star:code - VinaBench: Benchmark for Faithful and Consistent Visual Narratives
- OmniMMI: A Comprehensive Multi-modal Interaction Benchmark in Streaming Video Contexts
- CheckManual: A New Challenge and Benchmark for Manual-based Appliance Manipulation
- OmniDocBench: Benchmarking Diverse PDF Document Parsing with Comprehensive Annotations
- EEE-Bench: A Comprehensive Multimodal Electrical And Electronics Engineering Benchmark
- ComfyBench: Benchmarking LLM-based Agents in ComfyUI for Autonomously Designing Collaborative AI Systems
- Is Your World Simulator a Good Story Presenter? A Consecutive Events-Based Benchmark for Future Long Video Generation
- Q-Bench-Video: Benchmark the Video Quality Understanding of LMMs
- FSBench: A Figure Skating Benchmark for Advancing Artistic Sports Understanding
- Spatial457: A Diagnostic Benchmark for 6D Spatial Reasoning of Large Mutimodal Models
- Driving by the Rules: A Benchmark for Integrating Traffic Sign Regulations into Vectorized HD Map
- Towards Long-Horizon Vision-Language Navigation: Platform, Benchmark and Method
- SMTPD: A New Benchmark for Temporal Prediction of Social Media Popularity
- PQPP: A Joint Benchmark for Text-to-Image Prompt and Query Performance Prediction
- NSD-Imagery: A Benchmark Dataset for Extending fMRI Vision Decoding Methods to Mental Imagery
- From Words to Structured Visuals: A Benchmark and Framework for Text-to-Diagram Generation and Editing
- Mosaic of Modalities: A Comprehensive Benchmark for Multimodal Graph Learning
- RUBIK: A Structured Benchmark for Image Matching across Geometric Challenges
- HuPerFlow: A Comprehensive Benchmark for Human vs. Machine Motion Estimation Comparison
- OpenING: A Comprehensive Benchmark for Judging Open-ended Interleaved Image-Text Generation
- Can Machines Understand Composition? Dataset and Benchmark for Photographic Image Composition Embedding and Understanding
- LongVALE: Vision-Audio-Language-Event Benchmark Towards Time-Aware Omni-Modal Perception of Long Videos
- SeriesBench: A Benchmark for Narrative-Driven Drama Series Understanding
- Quad-Pixel Image Defocus Deblurring: A New Benchmark and Model
- MultiVENT 2.0: A Massive Multilingual Benchmark for Event-Centric Video Retrieval
- VL-RewardBench: A Challenging Benchmark for Vision-Language Generative Reward Models
- 数据集
- LiSu: A Dataset and Method for LiDAR Surface Normal Estimation
- HarmonySet: A Comprehensive Dataset for Understanding Video-Music Semantic Alignment and Temporal Synchronization
:star:code - MammAlps: A multi-view video behavior monitoring dataset of wild mammals in the Swiss Alps
:star:code - MultimodalStudio: A Heterogeneous Sensor Dataset and Framework for Neural Rendering across Multiple Imaging Modalities
- RoadSocial: A Diverse VideoQA Dataset and Benchmark for Road Event Understanding from Social Video Narratives
:star:code - ClimbingCap: Multi-Modal Dataset and Method for Rock Climbing in World Coordinate
:house:project - OpticalNet: An Optical Imaging Dataset and Benchmark Beyond the Diffraction Limit
- HD-EPIC: A Highly-Detailed Egocentric Video Dataset
- MM-OR: A Large Multimodal Operating Room Dataset for Semantic Understanding of High-Intensity Surgical Environments
- VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-Grained Video Reasoning via Core Frame Selection
- EgoPressure: A Dataset for Hand Pressure and Pose Estimation in Egocentric Vision
- BASKET: A Large-Scale Video Dataset for Fine-Grained Skill Estimation
- RealEdit: Reddit Edits As a Large-scale Empirical Dataset for Image Transformations
- CoMM: A Coherent Interleaved Image-Text Dataset for Multimodal Understanding and Generation
- GigaHands: A Massive Annotated Dataset of Bimanual Hand Activities
- Spotting the Unexpected (STU): A 3D LiDAR Dataset for Anomaly Segmentation in Autonomous Driving
- Fish-Vista: A Multi-Purpose Dataset for Understanding & Identification of Traits from Images
- Koala-36M: A Large-scale Video Dataset Improving Consistency between Fine-grained Conditions and Video Content
- Automatic Spectral Calibration of Hyperspectral Images: Method, Dataset and Benchmark
- The PanAf-FGBG Dataset: Understanding the Impact of Backgrounds in Wildlife Behaviour Recognition
- CholecTrack20: A Multi-Perspective Tracking Dataset for Surgical Tools
- Digital Twin Catalog: A Large-Scale Photorealistic 3D Object Digital Twin Dataset
- SPA-VL: A Comprehensive Safety Preference Alignment Dataset for Vision Language Models
- M3GYM: A Large-Scale Multimodal Multi-view Multi-person Pose Dataset for Fitness Activity Understanding in Real-world Settings
- 3D-GRAND: A Million-Scale Dataset for 3D-LLMs with Better Grounding and Less Hallucination
- Sketchtopia: A Dataset and Foundational Agents for Benchmarking Asynchronous Multimodal Communication with Iconic Feedback
- 人脸
- 自动驾驶
- HOI
- 视觉文本异常检测
- Dataset Distillation(数据集蒸馏)
- Curriculum Coarse-to-Fine Selection for High-IPC Dataset Distillation
:star:code - Dataset Distillation with Neural Characteristic Function: A Minmax Perspective
- Enhancing Dataset Distillation via Non-Critical Region Refinement
:star:code - Hierarchical Features Matter: A Deep Exploration of Progressive Parameterization Method for Dataset Distillation
- OPTICAL: Leveraging Optimal Transport for Contribution Allocation in Dataset Distillation
- DELT: A Simple Diversity-driven EarlyLate Training for Dataset Distillation
- Emphasizing Discriminative Features for Dataset Distillation in Complex Scenarios
- Towards Universal Dataset Distillation via Task-Driven Diffusion
- Towards Stable and Storage-efficient Dataset Distillation: Matching Convexified Trajectory
- Distilling Long-tailed Datasets
- Curriculum Coarse-to-Fine Selection for High-IPC Dataset Distillation
- 数据增强
37.Sound
- SoundVista: Novel-View Ambient Sound Synthesis via Visual-Acoustic Binding
- Learning to Highlight Audio by Watching Movies
:star:code - Seeing Speech and Sound: Distinguishing and Locating Audios in Visual Scenes
- Circumventing Shortcuts in Audio-visual Deepfake Detection Datasets with Unsupervised Learning
- UWAV: Uncertainty-weighted Weakly-supervised Audio-Visual Video Parsing
- Supervising Sound Localization by In-the-wild Egomotion
- EchoTraffic: Enhancing Traffic Anomaly Understanding with Audio-Visual Insights
- Crab: A Unified Audio-Visual Scene Understanding Model with Explicit Cooperation
- Language-Guided Audio-Visual Learning for Long-Term Sports Assessment
- CAV-MAE Sync: Improving Contrastive Audio-Visual Mask Autoencoders via Fine-Grained Alignment
- TSAM: Temporal SAM Augmented with Multimodal Prompts for Referring Audio-Visual Segmentation
- Adapting to the Unknown: Training-Free Audio-Visual Event Perception with Dynamic Thresholds
- Animate and Sound an Image
- Sound Bridge: Associating Egocentric and Exocentric Videos via Audio Cues
- Video-Guided Foley Sound Generation with Multimodal Controls
- Seeing Speech and Sound: Distinguishing and Locating Audio Sources in Visual Scenes
- Hearing Hands: Generating Sounds from Physical Interactions in 3D Scenes
- VinTAGe: Joint Video and Text Conditioning for Holistic Audio Generation
- DistinctAD: Distinctive Audio Description Generation in Contexts
- 视听分割
- SAM2-LOVE: Segment Anything Model 2 in Language-aided Audio-Visual Scenes
- Revisiting Audio-Visual Segmentation with Vision-Centric Transformer
- Robust Audio-Visual Segmentation via Audio-Guided Visual Convergent Alignment
- Dynamic Derivation and Elimination: Audio Visual Segmentation with Enhanced Audio Semantics
- 视听定位
- Video-to-Audio
- 语音转录
- 音乐制作
- 视频-音乐
36.Vision-Language
- Synthetic Data is an Elegant GIFT for Continual Vision-Language Models
- Words or Vision: Do Vision-Language Models Have Blind Faith in Text?
- Recurrence-Enhanced Vision-and-Language Transformers for Robust Multimodal Document Retrieval
:star:code - GFlowVLM: Enhancing Multi-step Reasoning in Vision-Language Models with Generative Flow Networks
- MMRL: Multi-Modal Representation Learning for Vision-Language Models
:star:code - DPC: Dual-Prompt Collaboration for Tuning Vision-Language Models
:star:code - From Head to Tail: Towards Balanced Representation in Large Vision-Language Models through Adaptive Data Calibration
- Hyperbolic Safety-Aware Vision-Language Models
:star:code - O-TPT: Orthogonality Constraints for Calibrating Test-time Prompt Tuning in Vision-Language Models
- MoManipVLA: Transferring Vision-language-action Models for General Mobile Manipulation
:star:code - Identifying and Mitigating Position Bias of Multi-image Vision-Language Models
- EfficientLLaVA:Generalizable Auto-Pruning for Large Vision-language Models
- Not Only Text: Exploring Compositionality of Visual Representations in Vision-Language Models
:star:code - Vision-Language Gradient Descent-driven All-in-One Deep Unfolding Networks
- Unveiling the Mist over 3D Vision-Language Understanding: Object-centric Evaluation with Chain-of-Analysis
:star:code - CoT-VLA: Visual Chain-of-Thought Reasoning for Vision-Language-Action Models
:star:code - It's a (Blind) Match! Towards Vision-Language Correspondence without Parallel Data
:star:code - Taxonomy-Aware Evaluation of Vision-Language Models
- SVLTA: Benchmarking Vision-Language Temporal Alignment via Synthetic Video Situation
- Assessing and Learning Alignment of Unimodal Vision and Language Models
- Dynamic Updates for Language Adaptation in Visual-Language Tracking
- Yo'Chameleon: Personalized Vision and Language Generation
- R-TPT: Improving Adversarial Robustness of Vision-Language Models through Test-Time Prompt Tuning
- LayoutVLM: Differentiable Optimization of 3D Layout via Vision-Language Models
- Exploring Visual Vulnerabilities via Multi-Loss Adversarial Search for Jailbreaking Vision-Language Models
- F^3OCUS - Federated Finetuning of Vision-Language Foundation Models with Optimal Client Layer Updating Strategy via Multi-objective Meta-Heuristics
- ICT: Image-Object Cross-Level Trusted Intervention for Mitigating Object Hallucination in Large Vision-Language Models
- SceneTAP: Scene-Coherent Typographic Adversarial Planner against Vision-Language Models in Real-World Environments
- Can Large Vision-Language Models Correct Semantic Grounding Errors By Themselves?
- SmartCLIP: Modular Vision-language Alignment with Identification Guarantees
- TAPT: Test-Time Adversarial Prompt Tuning for Robust Inference in Vision-Language Models
- Conical Visual Concentration for Efficient Large Vision-Language Models
- DINOv2 Meets Text: A Unified Framework for Image- and Pixel-Level Vision-Language Alignment
- Skip Tuning: Pre-trained Vision-Language Models are Effective and Efficient Adapters Themselves
- Document Haystacks: Vision-Language Reasoning Over Piles of 1000+ Documents
- Rethinking Few-Shot Adaptation of Vision-Language Models in Two Stages
- Once-Tuning-Multiple-Variants: Tuning Once and Expanded as Multiple Vision-Language Model Variants
- Post-pre-training for Modality Alignment in Vision-Language Foundation Models
- Enhancing Vision-Language Compositional Understanding with Multimodal Synthetic Data
- Joint Vision-Language Social Bias Removal for CLIP
- SOLAMI: Social Vision-Language-Action Modeling for Immersive Interaction with 3D Autonomous Characters
- Devils in Middle Layers of Large Vision-Language Models: Interpreting, Detecting and Mitigating Object Hallucinations via Attention Lens
- SLADE: Shielding against Dual Exploits in Large Vision-Language Models
- HiRes-LLaVA: Restoring Fragmentation Input in High-Resolution Large Vision-Language Models
- DH-Set: Improving Vision-Language Alignment with Diverse and Hybrid Set-Embeddings Learning
- Task-Aware Clustering for Prompting Vision-Language Models
- MambaVLT: Time-Evolving Multimodal State Space Model for Vision-Language Tracking
- Adaptive Parameter Selection for Tuning Vision-Language Models
- ShowUI: One Vision-Language-Action Model for GUI Visual Agent
- ReVisionLLM: Recursive Vision-Language Model for Temporal Grounding in Hour-Long Videos
- ProKeR: A Kernel Perspective on Few-Shot Adaptation of Large Vision-Language Models
- Vision-Language Models Do Not Understand Negation
- CoSpace: Benchmarking Continuous Space Perception Ability for Vision-Language Models
- HoVLE: Unleashing the Power of Monolithic Vision-Language Models with Holistic Vision-Language Embedding
- Nullu: Mitigating Object Hallucinations in Large Vision-Language Models via HalluSpace Projection
- Mitigating Object Hallucinations in Large Vision-Language Models with Assembly of Global and Local Attention
- MEET: Towards Memory-Efficient Temporal Sparse Deep Neural Networks
- Florence-VL: Enhancing Vision-Language Models with Generative Vision Encoder and Depth-Breadth Fusion
- Mamba-Reg: Vision Mamba Also Needs Registers
- Reproducible Vision-Language Models Meet Concepts Out of Pre-Training
- Your Large Vision-Language Model Only Needs A Few Attention Heads For Visual Grounding
- Bayesian Test-Time Adaptation for Vision-Language Models
- Mitigating Hallucinations in Large Vision-Language Models via DPO: On-Policy Data Hold the Key
- NLPrompt: Noise-Label Prompt Learning for Vision-Language Models
- Towards Understanding How Knowledge Evolves in Large Vision-Language Models
- Evaluating Vision-Language Models as Evaluators in Path Planning
- Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models
- Self-Evolving Visual Concept Library using Vision-Language Critics
- Revisiting Backdoor Attacks against Large Vision-Language Models from Domain Shift
- On the Zero-shot Adversarial Robustness of Vision-Language Models: A Truly Zero-shot and Training-free Approach
- ResCLIP: Residual Attention for Training-free Dense Vision-language Inference
- Realistic Test-Time Adaptation of Vision-Language Models
- PARC: A Quantitative Framework Uncovering the Symmetries within Vision Language Models
- What's in the Image? A Deep-Dive into the Vision of Vision Language Models
- Stealthy Backdoor Attack in Self-Supervised Learning Vision Encoders for Large Vision Language Models
- VLsI: Verbalized Layers-to-Interactions from Large to Small Vision Language Models
- Seeing the Abstract: Translating the Abstract Language for Vision Language Models
- VisionZip: Longer is Better but Not Necessary in Vision Language Models
- FastVLM: Efficient Vision Encoding for Vision Language Models
- COSMOS: Cross-Modality Self-Distillation for Vision Language Pre-training
- HalLoc: Token-level Localization of Hallucinations for Vision Language Models
- Steering Away from Harm: An Adaptive Approach to Defending Vision Language Model Against Jailbreaks
- Automated Generation of Challenging Multiple-Choice Questions for Vision Language Model Evaluation
- Lifelong Knowledge Editing for Vision Language Models with Low-Rank Mixture-of-Experts
- Flexible Frame Selection for Efficient Video Reasoning
- Motion-Grounded Video Reasoning: Understanding and Perceiving Motion at Pixel Level
- Black Swan: Abductive and Defeasible Video Reasoning in Unpredictable Events
- The Devil is in Temporal Token: High Quality Video Reasoning Segmentation
- LLM
- PAVE: Patching and Adapting Video Large Language Models
:star:code - Empowering Large Language Models with 3D Situation Awareness
- Dispider: Enabling Video LLMs with Active Real-Time Interaction via Disentangled Perception, Decision, and Reaction
- Gen3DEval: Using vLLMs for Automatic Evaluation of Generated 3D Objects
- Empowering LLMs to Understand and Generate Complex Vector Graphics
- All Languages Matter: Evaluating LMMs on Culturally Diverse 100 Languages
- 3D-LLaVA: Towards Generalist 3D LMMs with Omni Superpoint Transformer
- FLAME: Frozen Large Language Models Enable Data-Efficient Language-Image Pre-training
- Task-aware Cross-modal Feature Refinement Transformer with Large Language Models for Visual Grounding
- StoryGPT-V: Large Language Models as Consistent Story Visualizers
- Font-Agent: Enhancing Font Understanding with Large Language Models
- ChatGarment: Garment Estimation, Generation and Editing via Large Language Models
- CAD-Llama: Leveraging Large Language Models for Computer-Aided Design Parametric 3D Model Generation
- STEP: Enhancing Video-LLMs' Compositional Reasoning by Spatio-Temporal Graph-guided Self-Training
- PAVE: Patching and Adapting Video Large Language Models
- MLLM
- Efficient Motion-Aware Video MLLM
- MP-GUI: Modality Perception with MLLMs for GUI Understanding
- AesthetiQ: Enhancing Graphic Layout Design via Aesthetic-Aware Preference Alignment of Multi-modal Large Language Models
- Multi-Layer Visual Feature Fusion in Multimodal LLMs: Methods, Analysis, and Best Practices
:star:code - 4D LangSplat: 4D Language Gaussian Splatting via Multimodal Large Language Models
:star:code - UPME: An Unsupervised Peer Review Framework for Multimodal Large Language Model Evaluation
- LoRASculpt: Sculpting LoRA for Harmonizing General and Specialized Knowledge in Multimodal Large Language Models
- Scaling Vision Pre-Training to 4K Resolution
:star:code - Debiasing Multimodal Large Language Models via Noise-Aware Preference Optimization
- AdaMMS: Model Merging for Heterogeneous Multimodal Large Language Models with Unsupervised Coefficient Optimization
- SegAgent: Exploring Pixel Understanding Capabilities in MLLMs by Imitating Human Annotator Trajectories
:star:code - Seeing Far and Clearly: Mitigating Hallucinations in MLLMs with Attention Causal Decoding
- Task Preference Optimization: Improving Multimodal Large Language Models with Vision Task Alignment
- HEIE: MLLM-Based Hierarchical Explainable AIGC Image Implausibility Evaluator
- MLLM-as-a-Judge for Image Safety without Human Labeling
- BadToken: Token-level Backdoor Attacks to Multi-modal Large Language Models
- Do We Really Need Curated Malicious Data for Safety Alignment in Multi-modal Large Language Models?
- ClearSight: Visual Signal Enhancement for Object Hallucination Mitigation in Multimodal Large Language Models
- Img-Diff: Contrastive Data Synthesis for Multimodal Large Language Models
- Cross-modal Information Flow in Multimodal Large Language Models
- S4-Driver: Scalable Self-Supervised Driving Multimodal Large Language Model with Spatio-Temporal Visual Representation
- SeqAfford: Sequential 3D Affordance Reasoning via Multimodal Large Language Model
- RAP: Retrieval-Augmented Personalization for Multimodal Large Language Models
- Bridging Modalities: Improving Universal Multimodal Retrieval by Multimodal Large Language Models
- LLaVA-ST: A Multimodal Large Language Model for Fine-Grained Spatial-Temporal Understanding
- Insight-V: Exploring Long-Chain Visual Reasoning with Multimodal Large Language Models
- FlashSloth : Lightning Multimodal Large Language Models via Embedded Visual Compression
- Is `Right' Right? Enhancing Object Orientation Understanding in Multimodal Large Language Models through Egocentric Instruction Tuning
- EventGPT: Event Stream Understanding with Multimodal Large Language Models
- BlueLM-V-3B: Algorithm and System Co-Design for Multimodal Large Language Models on Mobile Devices
- ODE: Open-Set Evaluation of Hallucinations in Multimodal Large Language Models
- Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces
- Accelerating Multimodal Large Language Models by Searching Optimal Vision Token Reduction
- The Photographer's Eye: Teaching Multimodal Large Language Models to See, and Critique Like Photographers
- Mono-InternVL: Pushing the Boundaries of Monolithic Multimodal Large Language Models with Endogenous Visual Pre-training
- Towards Zero-Shot Anomaly Detection and Reasoning with Multimodal Large Language Models
- PEACE: Empowering Geologic Map Holistic Understanding with MLLMs
- Unveiling the Ignorance of MLLMs: Seeing Clearly, Answering Incorrectly
- VidComposition: Can MLLMs Analyze Compositions in Compiled Videos?
- DiffSensei: Bridging Multi-Modal LLMs and Diffusion Models for Customized Manga Generation
- Lifting the Veil on Visual Information Flow in MLLMs: Unlocking Pathways to Faster Inference
- Immune: Improving Safety Against Jailbreaks in Multi-modal LLMs via Inference-Time Alignment
- From Multimodal LLMs to Generalist Embodied Agents: Methods and Lessons
- AdaDARE-gamma: Balancing Stability and Plasticity in Multi-modal LLMs through Efficient Adaptation
- Distraction is All You Need for Multimodal Large Language Model Jailbreaking
- Period-LLM: Extending the Periodic Capability of Multimodal Large Language Model
- Referring Expression Comprehension(目标指代理解)
- VLM
- Ground-V: Teaching VLMs to Ground Complex Instructions in Pixels
- A Stitch in Time Saves Nine: Small VLM is a Precise Guidance for Accelerating Large VLMs
- VladVA: Discriminative Fine-tuning of LVLMs
- Galaxy Walker: Geometry-aware VLMs For Galaxy-scale Understanding
- DocVLM: Make Your VLM an Efficient Reader
- Critic-V: VLM Critics Help Catch VLM Errors in Multimodal Reasoning
35.Self-Supervised(监督)
- 自监督
- When the Future Becomes the Past: Taming Temporal Correspondence for Self-supervised Video Representation Learning
:star:code - Sonata: Self-Supervised Learning of Reliable Point Representations
:house:project - Invisible Backdoor Attack against Self-supervised Learning
- Probing the Mid-level Vision Capabilities of Self-Supervised Learning
- Generalized Recorrupted-to-Recorrupted: Self-Supervised Learning Beyond Gaussian Noise
- TAGA: Self-supervised Learning for Template-free Animatable Gaussian Articulated Model
- When the Future Becomes the Past: Taming Temporal Correspondence for Self-supervised Video Representation Learning
- 半监督
- Mind the Gap: Confidence Discrepancy Can Guide Federated Semi-Supervised Learning Across Pseudo-Mismatch
:star:code - Language-Assisted Debiasing and Smoothing for Foundation Model-Based Semi-Supervised Learning
- CGMatch: A Different Perspective of Semi-supervised Learning
- CLIP-driven Coarse-to-fine Semantic Guidance for Fine-grained Open-set Semi-supervised Learning
- A Unified Framework for Heterogeneous Semi-supervised Learning
- Learning Textual Prompts for Open-World Semi-Supervised Learning
- Mind the Gap: Confidence Discrepancy Can Guide Federated Semi-Supervised Learning Across Pseudo-Mismatch
34.Neural Architecture Search(神经架构搜索)
- Subnet-Aware Dynamic Supernet Training for Neural Architecture Search
- Training-free Neural Architecture Search through Variance of Knowledge of Deep Network Weights
- L-SWAG: Layer-Sample Wise Activation with Gradients Information for Zero-Shot NAS on Vision Transformers
33.MC/KD/Pruning(模型压缩/知识蒸馏/剪枝)
- KD
- Temporal Separation with Entropy Regularization for Knowledge Distillation in Spiking Neural Networks
- CustomKD: Customizing Large Vision Foundation for Edge Model Improvement via Knowledge Distillation
- MonoTAKD: Teaching Assistant Knowledge Distillation for Monocular 3D Object Detection
- U-Know-DiffPAN: An Uncertainty-aware Knowledge Distillation Diffusion Framework with Details Enhancement for PAN-Sharpening
- MoVE-KD: Knowledge Distillation for VLMs with Mixture of Visual Encoders
- VL2Lite: Task-Specific Knowledge Distillation from Large Vision-Language Models to Lightweight Networks
- DKDM: Data-Free Knowledge Distillation for Diffusion Models with Any Architecture
- What Makes a Good Dataset for Knowledge Distillation?
- 剪枝
- TopV: Compatible Token Pruning with Inference Time Optimization for Fast and Low-Memory Multimodal Vision Language Model
- PACT: Pruning and Clustering-Based Token Reduction for Faster Visual Language Models
- ATP: Adaptive Threshold Pruning for Efficient Data Encoding in Quantum Neural Networks
- Libra-Merging: Importance-redundancy and Pruning-merging Trade-off for Acceleration Plug-in in Large Vision-Language Model
- ICP: Immediate Compensation Pruning for Mid-to-high Sparsity
- ATP-LLaVA: Adaptive Token Pruning for Large Vision Language Models
- DivPrune: Diversity-based Visual Token Pruning for Large Multimodal Models
- Text-guided Sparse Voxel Pruning for Efficient 3D Visual Grounding
- Automatic Joint Structured Pruning and Quantization for Efficient Neural Network Training and Compression
- Flexible Group Count Enables Hassle-Free Structured Pruning
- MDP: Multidimensional Vision Model Pruning with Latency Constraint
- 量化
- MergeVQ: A Unified Framework for Visual Generation and Representation with Disentangled Token Merging and Quantization
:star:code - APHQ-ViT: Post-Training Quantization with Average Perturbation Hessian Based Reconstruction for Vision Transformers
- Q-DiT: Accurate Post-Training Quantization for Diffusion Transformers
- Pioneering 4-Bit FP Quantization for Diffusion Models: Mixup-Sign Quantization and Timestep-Aware Fine-Tuning
- MBQ: Modality-Balanced Quantization for Large Vision-Language Models
- Style Quantization for Data-Efficient GAN Training
- Quantization without Tears
- FIMA-Q: Post-Training Quantization for Vision Transformers by Fisher Information Matrix Approximation
- Enhancing Diversity for Data-free Quantization
- PillarHist: A Quantization-aware Pillar Feature Encoder based on Height-aware Histogram
- MergeVQ: A Unified Framework for Visual Generation and Representation with Disentangled Token Merging and Quantization
- MC
- DeepCompress-ViT: Rethinking Model Compression to Enhance Efficiency of Vision Transformers at the Edge
- Random Conditioning for Diffusion Model Compression with Distillation
- Hybrid-Level Instruction Injection for Video Token Compression in Multi-modal Large Language Models
:star:code - CASP: Compression of Large Multimodal Models Based on Attention Sparsity
- DyCoke: Dynamic Compression of Tokens for Fast Video Large Language Models
- VoCo-LLaMA: Towards Vision Compression with Large Language Models
- 4DGC: Rate-Aware 4D Gaussian Compression for Efficient Streamable Free-Viewpoint Video
- Layer- and Timestep-Adaptive Differentiable Token Compression Ratios for Efficient Diffusion Transformers
- Zero-shot 3D Question Answering via Voxel-based Dynamic Token Compression
- 模型增强
32Machine learning(机器学习)
- 机器遗忘
- 持续学习
- Do Your Best and Get Enough Rest for Continual Learning
:star:code - KAC: Kolmogorov-Arnold Classifier for Continual Learning
:star:code - Language Guided Concept Bottleneck Models for Interpretable Continual Learning
:star:code - Advancing Multiple Instance Learning with Continual Learning for Whole Slide Imaging
- Ferret: An Efficient Online Continual Learning Framework under Varying Memory Constraints
- Self-Expansion of Pre-trained Models with Mixture of Adapters for Continual Learning
- BiLoRA: Almost-Orthogonal Parameter Spaces for Continual Learning
- Online Task-Free Continual Learning via Dynamic Expansionable Memory Distribution
- Enhancing Online Continual Learning with Plug-and-Play State Space Model and Class-Conditional Mixture of Discretization
- LoRA Subtraction for Drift-Resistant Space in Exemplar-Free Continual Learning
- Handling Spatial-Temporal Data Heterogeneity for Federated Continual Learning via Tail Anchor
- Do Your Best and Get Enough Rest for Continual Learning
- 强化学习
- Towards Better Alignment: Training Diffusion Models with Reinforcement Learning Against Sparse Rewards
- Automated Proof of Polynomial Inequalities via Reinforcement Learning
- VLMs-Guided Representation Distillation for Efficient Vision-Based Reinforcement Learning
- Stabilizing and Accelerating Autofocus with Expert Trajectory Regularized Deep Reinforcement Learning
- Neural Motion Simulator Pushing the Limit of World Models in Reinforcement Learning
- 联邦学习
- Federated Learning with Domain Shift Eraser
- Geometric Knowledge-Guided Localized Global Distribution Alignment for Federated Learning
:star:code - FedCALM: Conflict-aware Layer-wise Mitigation for Selective Aggregation in Deeper Personalized Federated Learning
- FedMIA: An Effective Membership Inference Attack Exploiting "All for One" Principle in Federated Learning
- FedAWA: Adaptive Optimization of Aggregation Weights in Federated Learning Using Client Vectors
- Subspace Constraint and Contribution Estimation for Heterogeneous Federated Learning
- Beyond Local Sharpness: Communication-Efficient Global Sharpness-aware Minimization for Federated Learning
- Detecting Backdoor Attacks in Federated Learning via Direction Alignment Inspection
- FedCS: Coreset Selection for Federated Learning
- Population Normalization for Federated Learning
- Model Poisoning Attacks to Federated Learning via Multi-Round Consistency
- AFL: A Single-Round Analytic Approach for Federated Learning with Pre-trained Models
- A Simple Data Augmentation for Feature Distribution Skewed Federated Learning
- Fortifying Federated Learning Towards Trustworthiness via Auditable Data Valuation and Verifiable Client Contribution
- FedBiP: Heterogeneous One-Shot Federated Learning with Personalized Latent Diffusion Models
- Infighting in the Dark: Multi-Label Backdoor Attack in Federated Learning
- 主动学习
- 类增量学习
- Task-Agnostic Guided Feature Expansion for Class-Incremental Learning
:star:code - Tripartite Weight-Space Ensemble for Few-Shot Class-Incremental Learning
- SEC-Prompt:SEmantic Complementary Prompting for Few-Shot Class-Incremental Learning
- Enhancing Few-Shot Class-Incremental Learning via Training-Free Bi-Level Modality Calibration
- Multi-Granularity Class Prototype Topology Distillation for Class-Incremental Source-Free Unsupervised Domain Adaptation
- T-CIL: Temperature Scaling using Adversarial Perturbation for Calibration in Class-Incremental Learning
- CL-LoRA: Continual Low-Rank Adaptation for Rehearsal-Free Class-Incremental Learning
- Knowledge Memorization and Rumination for Pre-trained Model-based Class-Incremental Learning
- Adapter Merging with Centroid Prototype Mapping for Scalable Class-Incremental Learning
- pFedMxF: Personalized Federated Class-Incremental Learning with Mixture of Frequency Aggregation
- Learning Conditional Space-Time Prompt Distributions for Video Class-Incremental Learning
- Attraction Diminishing and Distributing for Few-Shot Class-Incremental Learning
- Order-Robust Class Incremental Learning: Graph-Driven Dynamic Similarity Grouping
- Dynamic Integration of Task-Specific Adapters for Class Incremental Learning
- Activating Sparse Part Concepts for 3D Class Incremental Learning
- Task-Agnostic Guided Feature Expansion for Class-Incremental Learning
- 对抗
- GASP: Efficient Black-Box Generation of Adversarial Suffixes for Jailbreaking LLMs
- CLIP is Strong Enough to Fight Back: Test-time Counterattacks towards Zero-shot Adversarial Robustness of CLIP
:star:code - Towards Effective and Sparse Adversarial Attack on Spiking Neural Networks via Breaking Invisible Surrogate Gradients
:star:code - Weakly Supervised Contrastive Adversarial Training for Learning Robust Features from Semi-supervised Data
- Anyattack: Towards Large-scale Self-supervised Adversarial Attacks on Vision-language Models
- Prompt2Perturb (P2P): Text-Guided Diffusion-Based Adversarial Attack on Breast Ultrasound Images
- MOS-Attack: A Scalable Multi-objective Adversarial Attack Framework
- Mind the Gap: Detecting Black-box Adversarial Attacks in the Making through Query Update Analysis
- Chain of Attack: On the Robustness of Vision-Language Models Against Transfer-Based Adversarial Attacks
- DEAL: Data-Efficient Adversarial Learning for High-Quality Infrared Imaging
- Towards Million-Scale Adversarial Robustness Evaluation With Stronger Individual Attacks
- Instant Adversarial Purification with Adversarial Consistency Distillation
- Seeing is Not Believing: Adversarial Natural Object Optimization for Hard-Label 3D Scene Attacks
- Saliuitl: Ensemble Salience Guided Recovery of Adversarial Patches against CNNs
- RAEncoder: A Label-Free Reversible Adversarial Examples Encoder for Dataset Intellectual Property Protection
- NitroFusion: High-Fidelity Single-Step Diffusion through Dynamic Adversarial Training
- Rethinking the Adversarial Robustness of Multi-Exit Neural Networks in an Attack-Defense Game
- PatchDEMUX: A Certifiably Robust Framework for Multi-label Classifiers Against Adversarial Patches
- OCL
- 多任务学习
- Decouple-Then-Merge: Finetune Diffusion Models as Multi-Task Learning
- Joint Scheduling of Causal Prompts and Tasks for Multi-Task Learning
- MMTL-UniAD: A Unified Framework for Multimodal and Multi-Task Learning in Assistive Driving Perception
- Identifying and Mitigating Spurious Correlation in Multi-Task Learning
- Towards Consistent Multi-Task Learning: Unlocking the Potential of Task-Specific Parameters
- TADFormer: Task-Adaptive Dynamic TransFormer for Efficient Multi-Task Learning
- 多标签学习
- 增量学习
- Low-Rank Adaptation in Multilinear Operator Networks for Security-Preserving Incremental Learning
- Dual Consolidation for Pre-Trained Model-Based Domain-Incremental Learning
- Boosting Domain Incremental Learning: Selecting the Optimal Parameters is All You Need
- Reducing Class-wise Confusion for Incremental Learning with Disentangled Manifolds
- 度量学习
- 启示学习
- 提示学习
- 对比学习
31.机器人导航/SLAM
- VR
- 虚拟试穿
- VTON 360: High-Fidelity Virtual Try-On from Any Viewing Direction
:star:code - Pursuing Temporal-Consistent Video Virtual Try-On via Dynamic Pose Interaction
- ITA-MDT: Image-Timestep-Adaptive Masked Diffusion Transformer Framework for Image-Based Virtual Try-On
:star:code - Robust-MVTON: Learning Cross-Pose Feature Alignment and Fusion for Robust Multi-View Virtual Try-On
- Enhancing Virtual Try-On with Synthetic Pairs and Error-Aware Noise Scheduling
- BooW-VTON: Boosting In-the-Wild Virtual Try-On via Mask-Free Pseudo Data Training
- Shining Yourself: High-Fidelity Ornaments Virtual Try-on with Diffusion Model
- VTON-HandFit: Virtual Try-on for Arbitrary Hand Pose Guided by Hand Priors Embedding
- AnyDressing: Customizable Multi-Garment Virtual Dressing via Latent Diffusion Models
- VTON 360: High-Fidelity Virtual Try-On from Any Viewing Direction
- 机器人
- VidBot: Learning Generalizable 3D Actions from In-the-Wild 2D Human Videos for Zero-Shot Robotic Manipulation
- Phoenix: A Motion-based Self-Reflection Framework for Fine-grained Robotic Action Correction
:star:code - A Data-Centric Revisit of Pre-Trained Vision Models for Robot Learning
:star:code - Two by Two: Learning Multi-Task Pairwise Objects Assembly for Generalizable Robot Manipulation
- Think Small, Act Big: Primitive Prompt Learning for Lifelong Robot Manipulation
- DexGrasp Anything: Towards Universal Robotic Dexterous Grasping with Physics Awareness
- DynScene: Scalable Generation of Dynamic Robotic Manipulation Scenes for Embodied AI
- Prof. Robot: Differentiable Robot Rendering Without Static and Self-Collisions
- RoboPEPP: Vision-Based Robot Pose and Joint Angle Estimation through Embedding Predictive Pre-Training
- RoboTwin: Dual-Arm Robot Benchmark with Generative Digital Twins
- MobileH2R: Learning Generalizable Human to Mobile Robot Handover Exclusively from Scalable and Diverse Synthetic Data
- RoboSense: Large-scale Dataset and Benchmark for Egocentric Robot Perception and Navigation in Crowded and Unstructured Environments
- Pixel-aligned RGB-NIR Stereo Imaging and Dataset for Robot Vision
- Lift3D Policy: Lifting 2D Foundation Models for Robust 3D Robotic Manipulation
- RoboGround: Robotic Manipulation with Grounded Vision-Language Priors
- PhysVLM: Enabling Visual Language Models to Understand Robotic Physical Reachability
- RoboBrain: A Unified Brain Model for Robotic Manipulation from Abstract to Concrete
- Tartan IMU: A Light Foundation Model for Inertial Positioning in Robotics
- Robotic Visual Instruction
- OmniManip: Towards General Robotic Manipulation via Object-Centric Interaction Primitives as Spatial Constraints
- Mitigating the Human-Robot Domain Discrepancy in Visual Pre-training for Robotic Manipulation
- Spatial-Temporal Graph Diffusion Policy with Kinematic Modeling for Bimanual Robotic Manipulation
- RoboSpatial: Teaching Spatial Understanding to 2D and 3D Vision-Language Models for Robotics
- UniGraspTransformer: Simplified Policy Distillation for Scalable Dexterous Robotic Grasping
- Code-as-Monitor: Constraint-aware Visual Programming for Reactive and Proactive Robotic Failure Detection
- FlowRAM: Grounding Flow Matching Policy with Region-Aware Mamba Framework for Robotic Manipulation
- ZeroGrasp: Zero-Shot Shape Reconstruction Enabled Robotic Grasping
- TASTE-Rob: Advancing Video Generation of Task-Oriented Hand-Object Interaction for Generalizable Robotic Manipulation
- Object-Centric Prompt-Driven Vision-Language-Action Model for Robotic Manipulation
- PDFactor: Learning Tri-Perspective View Policy Diffusion Field for Multi-Task Robotic Manipulation
- 视觉定位
- Scene-agnostic Pose Regression for Visual Localization
:star:code - Gaussian Splatting Feature Fields for (Privacy-Preserving) Visual Localization
- GPVK-VL: Geometry-Preserving Virtual Keyframes for Visual Localization under Large Viewpoint Changes
- Reloc3r: Large-Scale Training of Relative Camera Pose Regression for Generalizable, Fast, and Accurate Visual Localization
- R-SCoRe: Revisiting Scene Coordinate Regression for Robust Large-Scale Visual Localization
- Scene-agnostic Pose Regression for Visual Localization
- 地点/位置识别
- 手物交互/抓取
- EasyHOI: Unleashing the Power of Large Models for Reconstructing Hand-Object Interactions in the Wild
- UniHOPE: A Unified Approach for Hand-Only and Hand-Object Pose Estimation
- ManiVideo: Generating Hand-Object Manipulation Video with Dexterous and Generalizable Grasping
- HOIGPT: Learning Long-Sequence Hand-Object Interaction with Language Models
- Re-HOLD: Video Hand Object Interaction Reenactment via adaptive Layout-instructed Diffusion Model
- LatentHOI: On the Generalizable Hand Object Motion Generation with Latent Hand Diffusion.
- Hand-held Object Reconstruction from RGB Video with Dynamic Interaction
- Avatars
- AvatarArtist: Open-Domain 4D Avatarization
- FRAME: Floor-aligned Representation for Avatar Motion from Egocentric Video
:house:project - FRESA:Feedforward Reconstruction of Personalized Skinned Avatars from Few Images
:star:code - RGBAvatar: Reduced Gaussian Blendshapes for Online Modeling of Head Avatars
- GPAvatar: High-fidelity Head Avatars by Learning Efficient Gaussian Projections
- CAP4D: Creating Animatable 4D Portrait Avatars with Morphable Multi-View Diffusion Models
- Real-time High-fidelity Gaussian Human Avatars with Position-based Interpolation of Spatially Distributed MLPs
- HERA: Hybrid Explicit Representation for Ultra-Realistic Head Avatars
- SimAvatar: Simulation-Ready Avatars with Layered Hair and Clothing
- WildAvatar: Learning In-the-wild 3D Avatars from the Web
- Arc2Avatar: Generating Expressive 3D Avatars from a Single Image via ID Guidance
- 3D Gaussian Head Avatars with Expressive Dynamic Appearances by Compact Tensorial Representations
- LUCAS: Layered Universal Codec Avatars
- MobilePortrait: Real-Time One-Shot Neural Head Avatars on Mobile Devices
- GASP: Gaussian Avatars with Synthetic Priors
- Synthetic Prior for Few-Shot Drivable Head Avatar Inversion
- HRAvatar: High-Quality and Relightable Gaussian Head Avatar
- FADA: Fast Diffusion Avatar Synthesis with Mixed-Supervised Multi-CFG Distillation
- Creating Your Editable 3D Photorealistic Avatar with Tetrahedron-constrained Gaussian Splatting
- Vid2Avatar-Pro: Authentic Avatar from Videos in the Wild via Universal Prior
- FATE: Full-head Gaussian Avatar with Textural Editing from Monocular Video
- MeGA: Hybrid Mesh-Gaussian Head Avatar for High-Fidelity Rendering and Head Editing
- GAF: Gaussian Avatar Reconstruction from Monocular Videos via Multi-view Diffusion
- Towards High-fidelity 3D Talking Avatar with Personalized Dynamic Texture
- VLOGGER: Multimodal Diffusion for Embodied Avatar Synthesis
- DAGSM: Disentangled Avatar Generation with GS-enhanced Mesh
- AniGS: Animatable Gaussian Avatar from a Single Image with Inconsistent Gaussian Reconstruction
- GeoAvatar: Geometrically-Consistent Multi-Person Avatar Reconstruction from Sparse Multi-View Videos
- EasyCraft: A Robust and Efficient Framework for Automatic Avatar Crafting
- SLAM
- VLN
30.Gaze Estimation(视线估计)
- GA3CE: Unconstrained 3D Gaze Estimation with Gaze-Aware 3D Context Encoding
:star:code - FIFA: Fine-grained Inter-frame Attention for Driver's Video Gaze Estimation
- 3D Prior Is All You Need: Cross-Task Few-shot 2D Gaze Estimation
- De^2Gaze: Deformable and Decoupled Representation Learning for 3D Gaze Estimation
- Enhancing 3D Gaze Estimation in the Wild using Weak Supervision with Gaze Following Labels
- Gazing Into Missteps: Leveraging Eye-Gaze for Unsupervised Mistake Detection in Egocentric Videos of Skilled Human Activities
- GazeGene: Large-scale Synthetic Gaze Dataset with 3D Eyeball Annotations
- Gazing at Rewards: Eye Movements as a Lens into Human and AI Decision-Making in Hybrid Visual Foraging
- Gaze-LLE: Gaze Target Estimation via Large-Scale Learned Encoders
29.Scene Flow Estimation(场景流估计)
- Floxels: Fast Unsupervised Voxel Based Scene Flow Estimation
- VoteFlow: Enforcing Local Rigidity in Self-Supervised Scene Flow
:star:code - STCOcc: Sparse Spatial-Temporal Cascade Renovation for 3D Occupancy and Scene Flow Prediction
- Zero-Shot Monocular Scene Flow Estimation in the Wild
- SCFlow2: Plug-and-Play Object Pose Refiner with Shape-Constraint Scene Flow
28.Optical Flow Estimation(光流估计)
- DPFlow: Adaptive Optical Flow Estimation with a Dual-Pyramid Framework
:star:code - EDCFlow: Exploring Temporally Dense Difference Maps for Event-based Optical Flow Estimation
- Shape and Texture: What Influences Reliable Optical Flow Estimation?
- Bridge Frame and Event: Common Spatiotemporal Fusion for High-Dynamic Scene Optical Flow
- Multi-Modal Synergistic Implicit Image Enhancement for Efficient Optical Flow Estimation
27.Scene Graph Generation(场景图生成)
- Universal Scene Graph Generation
- Learning 4D Panoptic Scene Graph Generation from Rich 2D Visual Scene
- Unbiased Video Scene Graph Generation via Visual and Semantic Dual Debiasing
- DIFFVSGG: Diffusion-Driven Online Video Scene Graph Generation
:star:code - Conformal Prediction and MLLM aided Uncertainty Quantification in Scene Graph Generation
- Open-Vocabulary Functional 3D Scene Graphs for Real-World Indoor Spaces
:star:code - Hybrid Reciprocal Transformer with Triplet Feature Alignment for Scene Graph Generation
- HyperGLM: HyperGraph for Video Scene Graph Generation and Anticipation
- Navigating the Unseen: Zero-shot Scene Graph Generation via Capsule-Based Equivariant Features
- Towards Unbiased and Robust Spatio-Temporal Scene Graph Generation and Anticipation
26.Style Transfer(风格迁移)
- OmniStyle: Filtering High Quality Style Transfer Data at Scale
- SCSA: A Plug-and-Play Semantic Continuous-Sparse Attention for Arbitrary Semantic Style Transfer
- Geometry in Style: 3D Stylization via Surface Normal Deformation
:star:code - StyleSSP: Sampling StartPoint Enhancement for Training-free Diffusion-based Method for Style Transfer
- HSI: A Holistic Style Injector for Arbitrary Style Transfer
- SaMam: Style-aware State Space Model for Arbitrary Image Style Transfer
- SGSST: Scaling Gaussian Splatting Style Transfer
- StyleStudio: Text-Driven Style Transfer with Selective Control of Style Elements
- PTDiffusion: Free Lunch for Generating Optical Illusion Hidden Pictures with Phase-Transferred Diffusion Model
- Attention Distillation: A Unified Approach to Visual Characteristics Transfer
:star:code - Efficient Transfer Learning for Video-language Foundation Models
- 运动迁移
25.GAN/Image Synthesis(图像生成)
- Z-Magic: Zero-shot Multiple Attributes Guided Image Creator
- TreeMeshGPT: Artistic Mesh Generation with Autoregressive Tree Sequencing
:star:code - AnyMoLe: Any Character Motion In-betweening Leveraging Video Diffusion Models
- ODA-GAN: Orthogonal Decoupling Alignment GAN Assisted by Weakly-supervised Learning for Virtual Immunohistochemistry Staining
- Chat2SVG: Vector Graphics Generation with Large Language Models and Image Diffusion Models
- Scaling Mesh Generation via Compressive Tokenization
- CraftsMan3D: High-fidelity Mesh Generation with 3D Native Diffusion and Interactive Geometry Refiner
- Mimir: Improving Video Diffusion Models for Precise Text Understanding
- AC3D: Analyzing and Improving 3D Camera Control in Video Diffusion Transformers
- VideoGuide: Improving Video Diffusion Models without Training Through a Teacher's Guide
- Towards Precise Scaling Laws for Video Diffusion Transformers
- WF-VAE: Enhancing Video VAE by Wavelet-Driven Energy Flow for Latent Video Diffusion Model
- Go-with-the-Flow: Motion-Controllable Video Diffusion Models Using Real-Time Warped Noise
- Articulated Kinematics Distillation from Video Diffusion Models
- Improved Video VAE for Latent Video Diffusion Model
- From Slow Bidirectional to Fast Autoregressive Video Diffusion Models
- InterDyn: Controllable Interactive Dynamics with Video Diffusion Models
- Repurposing Pre-trained Video Diffusion Models for Event-based Video Interpolation
- GAN
- 扩散模型
- ICE: Intrinsic Concept Extraction from a Single Image via Diffusion Models
:star:code - Probability Density Geodesics in Image Diffusion Latent Space
- PCM : Picard Consistency Model for Fast Parallel Sampling of Diffusion Models
- MirrorVerse: Pushing Diffusion Models to Realistically Reflect the World
:star:code - Dissecting and Mitigating Diffusion Bias via Mechanistic Interpretability
:star:code - CacheQuant: Comprehensively Accelerated Diffusion Models
:star:code - Enhancing Privacy-Utility Trade-offs to Mitigate Memorization in Diffusion Models
- Personalized Preference Fine-tuning of Diffusion Models
- APT: Adaptive Personalized Training for Diffusion Models with Limited Data
- Erasing Undesirable Influence in Diffusion Models
- LoRACLR: Contrastive Adaptation for Customization of Diffusion Models
- SIR-DIFF: Sparse Image Sets Restoration with Multi-View Diffusion Model
- TKG-DM: Training-free Chroma Key Content Generation Diffusion Model
- Reconstruction vs. Generation: Taming Optimization Dilemma in Latent Diffusion Models
- Scaling Properties of Diffusion Models For Perceptual Tasks
- Not All Parameters Matter: Masking Diffusion Models for Enhancing Generation Ability
- Concept Replacer: Replacing Sensitive Concepts in Diffusion Models via Precision Localization
- SemanticDraw: Towards Real-Time Interactive Content Creation from Image Diffusion Models
- Efficient Fine-Tuning and Concept Suppression for Pruned Diffusion Models
- Hyperspectral Pansharpening via Diffusion Models with Iteratively Zero-Shot Guidance
- Scaling Inference Time Compute for Diffusion Models
- Diff-Palm: Realistic Palmprint Generation with Polynomial Creases and Intra-Class Variation Controllable Diffusion Models
- Nearly Zero-Cost Protection Against Mimicry by Personalized Diffusion Models
- InPO: Inversion Preference Optimization with Reparametrized DDIM for Efficient Diffusion Model Alignment
- Efficient Personalization of Quantized Diffusion Model without Backpropagation
- Alias-Free Latent Diffusion Models: Improving Fractional Shift Equivariance of Diffusion Latent Space
- Early-Bird Diffusion: Investigating and Leveraging Timestep-Aware Early-Bird Tickets in Diffusion Models for Efficient Training
- ICE: Intrinsic Concept Extraction from a Single Image via Diffusion Models
- 图像编辑
- FireEdit: Fine-grained Instruction-based Image Editing via Region-aware Vision Language Model
:star:code - Pattern Analogies: Learning to Perform Programmatic Image Edits by Analogy
- Reference-Based 3D-Aware Image Editing with Triplanes
- MoEdit: On Learning Quantity Perception for Multi-object Image Editing
- Text-Driven Fashion Image Editing with Compositional Concept Learning and Counterfactual Abduction
- Dragin3D: Image Editing by Dragging in 3D Space
- Towards Scalable Human-aligned Benchmark for Text-guided Image Editing
- AnyEdit: Mastering Unified High-Quality Image Editing for Any Idea
- Unveil Inversion and Invariance in Flow Transformer for Versatile Image Editing
- MagicQuill: An Intelligent Interactive Image Editing System
- InsightEdit: Towards Better Instruction Following for Image Editing
- Concept Lancet: Image Editing with Compositional Representation Transplant
- Preserve or Modify? Context-Aware Evaluation for Balancing Preservation and Modification in Text-Guided Image Editing
- Instruct-CLIP: Improving Instruction-Guided Image Editing with Automated Data Refinement Using Contrastive Learning
- PS-Diffusion: Photorealistic Subject-Driven Image Editing with Disentangled Control and Attention
- FeedEdit: Text-Based Image Editing with Dynamic Feedback Regulation
- Visual Representation Learning through Causal Intervention for Controllable Image Editing
- Stable Flow: Vital Layers for Training-Free Image Editing
- PhyS-EdiT: Physics-aware Semantic Image Editing with Text Description
- FDS: Frequency-Aware Denoising Score for Text-Guided Latent Diffusion Image Editing
- SwiftEdit: Lightning Fast Text-Guided Image Editing via One-Step Diffusion
- Rashomon Sets for Prototypical-Part Networks: Editing Interpretable Models in Real-Time
- FireEdit: Fine-grained Instruction-based Image Editing via Region-aware Vision Language Model
- 海报生成
- 图像合成
- Multi-focal Conditioned Latent Diffusion for Person Image Synthesis
:star:code - Diffusion-4K: Ultra-High-Resolution Image Synthesis with Latent Diffusion Models
- Science-T2I: Addressing Scientific Illusions in Image Synthesis
- Consistency Posterior Sampling for Diverse Image Synthesis
- Infinity: Scaling Bitwise AutoRegressive Modeling for High-Resolution Image Synthesis
- Training-free Dense-Aligned Diffusion Guidance for Modular Conditional Image Synthesis
- Exploring Sparse MoE in GANs for Text-conditioned Image Synthesis
- Anatomical Consistency and Adaptive Prior-informed Transformation for Multi-contrast MR Image Synthesis via Diffusion Model
- Multi-focal Conditioned Latent Diffusion for Person Image Synthesis
- 三维生成
- DirectTriGS: Triplane-based Gaussian Splatting Field Representation for 3D Generation
- MAR-3D: Progressive Masked Auto-regressor for High-Resolution 3D Generation
- 3DTopia-XL: Scaling High-quality 3D Asset Generation via Primitive Diffusion
- SKDream: Controllable Multi-view and 3D Generation with Arbitrary Skeletons
- PERSE: Personalized 3D Generative Avatars from A Single Portrait
- Ouroboros3D: Image-to-3D Generation via 3D-aware Recursive Diffusion
- Hash3D: Training-free Acceleration for 3D Generation
- Hierarchical Gaussian Mixture Model Splatting for Efficient and Part Controllable 3D Generation
- Symmetry Strikes Back: From Single-Image Symmetry Detection to 3D Generation
- Eval3D: Interpretable and Fine-grained Evaluation for 3D Generation
- Transfer Your Perspective: Controllable 3D Generation from Any Viewpoint in a Driving Scene
- ARM: Appearance Reconstruction Model for Relightable 3D Generation
- SeaLion: Semantic Part-Aware Latent Point Diffusion Models for 3D Generation
- Structured 3D Latents for Scalable and Versatile 3D Generation
- PartGen: Part-level 3D Generation and Reconstruction with Multi-view Diffusion Models
- 图像生成
- DesignDiffusion: High-Quality Text-to-Design Image Generation with Diffusion Models
- Zero-Shot Styled Text Image Generation, but Make It Autoregressive
- FAM Diffusion: Frequency and Attention Modulation for High-Resolution Image Generation with Stable Diffusion
- DreamOmni: Unified Image Generation and Editing
- UniReal: Universal Image Generation and Editing via Learning Real-world Dynamics
- Controllable Human Image Generation with Personalized Multi-Garments
- IDProtector: An Adversarial Noise Encoder to Protect Against ID-Preserving Image Generation
- ART: Anonymous Region Transformer for Variable Multi-Layer Transparent Image Generation
- GPS as a Control Signal for Image Generation
- Dual-Interrelated Diffusion Model for Few-Shot Anomaly Image Generation
- DesignDiffusion: High-Quality Text-to-Design Image Generation with Diffusion Models
- Devil is in the Detail: Towards Injecting Fine Details of Image Prompt in Image Generation via Conflict-free Guidance and Stratified Attention
- TFCustom: Customized Image Generation with Time-Aware Frequency Feature Guidance
- HMAR: Efficient Hierarchical Masked Auto-Regressive Image Generation
- Image Generation Diversity Issues and How to Tame Them
- T2ISafety: Benchmark for Assessing Fairness, Toxicity, and Privacy in Image Generation
- DreamCache: Finetuning-Free Lightweight Personalized Image Generation via Feature Caching
- Dual Diffusion for Unified Image Generation and Understanding
- Learning Flow Fields in Attention for Controllable Person Image Generation
- Improving Editability in Image Generation with Layer-wise Memory
- ZoomLDM: Latent Diffusion Model for Multi-scale Image Generation
- D^2iT: Dynamic Diffusion Transformer for Accurate Image Generation
- Schedule On the Fly: Diffusion Time Prediction for Faster and Better Image Generation
- Let's Verify and Reinforce Image Generation Step by Step
- PatchDPO: Patch-level DPO for Finetuning-free Personalized Image Generation
- Diffusion Self-Distillation for Zero-Shot Customized Image Generation
- SerialGen: Personalized Image Generation by First Standardization Then Personalization
- Boost Your Human Image Generation Model via Direct Preference Optimization
- FoundHand: Large-Scale Domain-Specific Learning for Controllable Hand Image Generation
- Conditional Balance: Improving Multi-Conditioning Trade-Offs in Image Generation
- UNIC-Adapter: Unified Image-instruction Adapter with Multi-modal Transformer for Image Generation
- Not Just Text: Uncovering Vision Modality Typographic Threats in Image Generation Models
- OmniGen: Unified Image Generation
- Spherical Manifold Guided Diffusion Model for Panoramic Image Generation
- Large-Scale Text-to-Image Model with Inpainting is a Zero-Shot Subject-Driven Image Generator
- 图像-视频
- Extrapolating and Decoupling Image-to-Video Generation Models: Motion Modeling is Easier Than You Think
- I2VGuard: Safeguarding Images against Misuse in Diffusion-based Image-to-Video Models
- Through-The-Mask: Mask-based Motion Trajectories for Image-to-Video Generation
- LeviTor: 3D Trajectory Oriented Image-to-Video Synthesis
- MotionStone: Decoupled Motion Intensity Modulation with Diffusion Transformer for Image-to-Video Generation
- MotionPro: A Precise Motion Controller for Image-to-Video Generation
- 文本-图像
- Silent Branding Attack: Trigger-free Data Poisoning Attack on Text-to-Image Diffusion Models
:star:code - Compass Control: Multi Object Orientation Control for Text-to-Image Generation
:star:code - ConceptGuard: Continual Personalized Text-to-Image Generation with Forgetting and Confusion Mitigation
- DriveGEN: Generalized and Robust 3D Detection in Driving via Controllable Text-to-Image Diffusion Generation
:star:code - Localized Concept Erasure for Text-to-Image Diffusion Models Using Training-Free Gated Low-Rank Adaptation
- Detect-and-Guide: Self-regulation of Diffusion Models for Safe Text-to-Image Generation via Guideline Token Optimization
- Fine-Grained Erasure in Text-to-Image Diffusion-based Foundation Models
- Scaling Down Text Encoders of Text-to-Image Diffusion Models
- Spatial Transport Optimization by Repositioning Attention Map for Training-Free Text-to-Image Synthesis
- Implicit Bias Injection Attacks against Text-to-Image Diffusion Models
:star:code - VODiff: Controlling Object Visibility Order in Text-to-Image Generation
- PreciseCam: Precise Camera Control for Text-to-Image Generation
- CoSER: Towards Consistent Dense Multiview Text-to-Image Generator for 3D Creation
- Noise Diffusion for Enhancing Semantic Faithfulness in Text-to-Image Synthesis
- Generative Photography: Scene-Consistent Camera Control for Realistic Text-to-Image Synthesis
- Self-Cross Diffusion Guidance for Text-to-Image Synthesis of Similar Subjects
- Exploring the Deep Fusion of Large Language Models and Diffusion Transformers for Text-to-Image Synthesis
- ChatGen: Automatic Text-to-Image Generation From FreeStyle Chatting
- MCCD: Multi-Agent Collaboration-based Compositional Diffusion for Complex Text-to-Image Generation
- Learning to Sample Effective and Diverse Prompts for Text-to-Image Generation
- A Comprehensive Study of Decoder-Only LLMs for Text-to-Image Generation
- Plug-and-Play Interpretable Responsible Text-to-Image Generation via Dual-Space Multi-facet Concept Control
- Rethinking Training for De-biasing Text-to-Image Generation: Unlocking the Potential of Stable Diffusion
- Towards Understanding and Quantifying Uncertainty for Text-to-Image Generation
- Focus-N-Fix: Region-Aware Fine-Tuning for Text-to-Image Generation
- SILMM: Self-Improving Large Multimodal Models for Compositional Text-to-Image Generation
- Type-R: Automatically Retouching Typos for Text-to-Image Generation
- Make It Count: Text-to-Image Generation with an Accurate Number of Objects
- Minority-Focused Text-to-Image Generation via Prompt Optimization
- STEPS: Sequential Probability Tensor Estimation for Text-to-Image Hard Prompt Search
- STEREO: A Two-Stage Framework for Adversarially Robust Concept Erasing from Text-to-Image Diffusion Models
- Multi-Group Proportional Representations for Text-to-Image Models
- ACE: Anti-Editing Concept Erasure in Text-to-Image Models
- SnapGen: Taming High-Resolution Text-to-Image Models for Mobile Devices with Efficient Architectures and Training
- The Illusion of Unlearning: The Unstable Nature of Machine Unlearning in Text-to-Image Diffusion Models
- Six-CD: Benchmarking Concept Removals for Text-to-image Diffusion Models
- One-Way Ticket: Time-Independent Unified Encoder for Distilling Text-to-Image Diffusion Models
- Text Embedding is Not All You Need: Attention Control for Text-to-Image Semantic Alignment with Text Self-Attention Maps
- Silent Branding Attack: Trigger-free Data Poisoning Attack on Text-to-Image Diffusion Models
- 文本-视频
- Can Text-to-Video Generation help Video-Language Alignment?
:star:code - VideoMage: Multi-Subject and Motion Customization of Text-to-Video Diffusion Models
:star:code - EIDT-V: Exploiting Intersections in Diffusion Trajectories for Model-Agnostic, Zero-Shot, Training-Free Text-to-Video Generation
- LinGen: Towards High-Resolution Minute-Length Text-to-Video Generation with Linear Computational Complexity
- AIGV-Assessor: Benchmarking and Evaluating the Perceptual Quality of Text-to-Video Generation with LMM
- ByTheWay: Boost Your Text-to-Video Generation Model to Higher Quality in a Training-free Way
- PhyT2V: LLM-Guided Iterative Self-Refinement for Physics-Grounded Text-to-Video Generation
- TransPixeler: Advancing Text-to-Video Generation with Transparency
- OSV: One Step is Enough for High-Quality Image to Video Generation
- InstanceCap: Improving Text-to-Video Generation via Instance-aware Structured Caption
- T2V-CompBench: A Comprehensive Benchmark for Compositional Text-to-video Generation
- The Devil is in the Prompts: Retrieval-Augmented Prompt Optimization for Text-to-Video Generation
- BlobGEN-Vid: Compositional Text-to-Video Generation with Blob Video Representations
- Identity-Preserving Text-to-Video Generation by Frequency Decomposition
- Encapsulated Composition of Text-to-Image and Text-to-Video Models for High-Quality Video Synthesis* Neuro-Symbolic Evaluation of Text-to-Video Models using Formal Verification
- Can Text-to-Video Generation help Video-Language Alignment?
- 视频合成
- MaskDiT: Dual Mask-based Diffusion Transformer for Multi-Scene Long Video Generation
:star:code - SketchVideo: Sketch-based Video Generation and Editing
- One-Minute Video Generation with Test-Time Training
:star:code - Video-Bench: Human-Aligned Video Generation Benchmark
- GEN3C: 3D-Informed World-Consistent Video Generation with Precise Camera Control
:house:project - AR-Diffusion: Asynchronous Video Generation with Auto-Regressive Diffusion
- Long Video Diffusion Generation with Segmented Cross-Attention and Content-Rich Video Data Curation
- VideoDPO: Omni-Preference Alignment for Video Diffusion Generation
- VideoScene: Distilling Video Diffusion Model to Generate 3D Scenes in One Step
- Tora: Trajectory-oriented Diffusion Transformer for Video Generation
- Pathways on the Image Manifold: Image Editing via Video Generation
- STDD: Spatio-Temporal Dual Diffusion for Video Generation
- TokenMotion: Decoupled Motion Control via Token Disentanglement for Human-centric Video Generation
- Mind the Time: Temporally-Controlled Multi-Event Video Generation
- FreePCA: Integrating Consistency Information across Long-short Frames in Training-free Long Video Generation via Principal Component Analysis
- Motion Prompting: Controlling Video Generation with Motion Trajectories
- StreamingT2V: Consistent, Dynamic, and Extendable Long Video Generation from Text
- DynamicScaler: Seamless and Scalable Video Generation for Panoramic Scenes
- IM-Zero: Instance-level Motion Controllable Video Generation in a Zero-shot Manner
- ShotAdapter: Text-to-Multi-Shot Video Generation with Diffusion Models
- DriveScape: High-Resolution Driving Video Generation by Multi-View Feature Fusion
- LongDiff: Training-Free Long Video Generation in One Go
- GS-DiT: Advancing Video Generation with Dynamic 3D Gaussian Fields through Efficient Dense 3D Point Tracking
- SpatialDreamer: Self-supervised Stereo Video Synthesis from Monocular Input
- MIMO: Controllable Character Video Synthesis with Spatial Decomposed Modeling
- FloVD: Optical Flow Meets Video Diffusion Model for Enhanced Camera-Controlled Video Synthesis
- Geometry-guided Online 3D Video Synthesis with Multi-View Temporal Consistency
- Co-Speech Gesture Video Generation with Implicit Motion-Audio Entanglement
- AKiRa: Augmentation Kit on Rays for Optical Video Generation
- Taming Teacher Forcing for Masked Autoregressive Video Generation
- VideoHandles: Editing 3D Object Compositions in Videos Using Video Generative Priors
- Optical-Flow Guided Prompt Optimization for Coherent Video Generation
- Goku: Flow Based Video Generative Foundation Models
- Track4Gen: Teaching Video Diffusion Models to Track Points Improves Video Generation
- Multi-subject Open-set Personalization in Video Generation
- DiTCtrl: Exploring Attention Control in Multi-Modal Diffusion Transformer for Tuning-Free Multi-Prompt Longer Video Generation
- MovieBench: A Hierarchical Movie Level Dataset for Long Video Generation
- OpenHumanVid: A Large-Scale High-Quality Dataset for Enhancing Human-Centric Video Generation
- Generative Inbetweening through Frame-wise Conditions-Driven Video Generation
- AnimateAnything: Consistent and Controllable Animation for Video Generation
- 音频驱动的人体视频合成
- MaskDiT: Dual Mask-based Diffusion Transformer for Multi-Scene Long Video Generation
- 视频风格化
- 文本-网格
- 视频编辑
- Visual Prompting for One-shot Controllable Video Editing without Inversion
- VEU-Bench: Towards Comprehensive Understanding of Video Editing
- FADE: Frequency-Aware Diffusion Model Factorization for Video Editing
- Unity in Diversity: Video Editing via Gradient-Latent Purification
- VideoDirector: Precise Video Editing via Text-to-Video Models
- Align-A-Video: Deterministic Reward Tuning of Image Diffusion Models for Consistent Video Editing
- VideoSPatS: Video SPatiotemporal Splines for Disentangled Occlusion, Appearance and Motion Modeling and Editing
- Image-to-Image Translation
- 文本-3D
- Prometheus: 3D-Aware Latent Diffusion Models for Feed-Forward Text-to-3D Scene Generation
- Turbo3D: Ultra-fast Text-to-3D Generation
- CompGS: Unleashing 2D Compositionality for Compositional Text-to-3D via Dynamically Optimizing 3D Gaussians
- Apply Hierarchical-Chain-of-Generation to Complex Attributes Text-to-3D Generation
- MARVEL-40M+: Multi-Level Visual Elaboration for High-Fidelity Text-to-3D Content Creation
- 文本-运动
- Any-to-Any
- 编辑
- 图像裁剪
- 布局生成
- 视频-文本
- 图像矢量化
24.Video
- VITED: Video Temporal Evidence Distillation
- LION-FS: Fast & Slow Video-Language Thinker as Online Video Assistant
- Tracktention: Leveraging Point Tracking to Attend Videos Faster and Better
- Bootstrap Your Own Views: Masked Ego-Exo Modeling for Fine-grained View-invariant Video Representations
:star:code - Protecting Your Video Content: Disrupting Automated Video-based LLM Annotations
:star:code - VideoTree: Adaptive Tree-based Video Representation for LLM Reasoning on Long Videos
- LATTE-MV: Learning to Anticipate Table Tennis Hits from Monocular Videos
- Chapter-Llama: Efficient Chaptering in Hour-Long Videos with LLMs
:house:project - Video-Panda: Parameter-efficient Alignment for Encoder-free Video-Language Models
- VELOCITI: Benchmarking Video-Language Compositional Reasoning with Strict Entailment
- On the Consistency of Video Large Language Models in Temporal Comprehension
- Augmented Deep Contexts for Spatially Embedded Video Coding
- SyncVP: Joint Diffusion for Synchronous Multi-Modal Video Prediction
- Timestep Embedding Tells: It's Time to Cache for Video Diffusion Model
- Seq2Time: Sequential Knowledge Transfer for Video LLM Temporal Grounding
- Learning Temporally Consistent Video Depth from Video Diffusion Priors
- 视频监控
- 视频理解
- Adaptive Keyframe Sampling for Long Video Understanding
:star:code - HierarQ: Task-Aware Hierarchical Q-Former for Enhanced Video Understanding
- VLog: Video-Language Models by Generative Retrieval of Narration Vocabulary
:star:code - BOLT: Boost Large Vision-Language Model Without Training for Long-form Video Understanding
:star:code - SF2T: Self-supervised Fragment Finetuning of Video-LLMs for Fine-Grained Understanding
- ReWind: Understanding Long Videos with Instructed Learnable Memory
- DIV-FF: Dynamic Image-Video Feature Fields For Environment Understanding in Egocentric Videos
- MotionBench: Benchmarking and Improving Fine-grained Video Motion Understanding for Vision Language Models
- VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with Video LLM
- Understanding Multi-Task Activities from Single-Task Videos
- ViCaS: A Dataset for Combining Holistic and Pixel-level Video Understanding using Captions with Grounded Segmentation
- VideoICL: Confidence-based Iterative In-context Learning for Out-of-Distribution Video Understanding
- VISTA: Enhancing Long-Duration and High-Resolution Video Understanding by Video Spatiotemporal Augmentation
- DrVideo: Document Retrieval Based Long Video Understanding
- Re-thinking Temporal Search for Long-Form Video Understanding
- DynFocus: Dynamic Cooperative Network Empowers LLMs with Video Understanding
- Towards Universal Soccer Video Understanding
- Video-XL: Extra-Long Vision Language Model for Hour-Scale Video Understanding
- Apollo: An Exploration of Video Understanding in Large Multimodal Models
- MLVU: Benchmarking Multi-task Long Video Understanding
- MMVU: Measuring Expert-Level Multi-Discipline Video Understanding
- M-LLM Based Video Frame Selection for Efficient Video Understanding
- Online Video Understanding: OVBench and VideoChat-Online
- STOP: Integrated Spatial-Temporal Dynamic Prompting for Video Understanding
- OVO-Bench: How Far is Your Video-LLMs from Real-World Online Video Understanding?
- VidHalluc: Evaluating Temporal Hallucinations in Multimodal Large Language Models for Video Understanding
- Adaptive Keyframe Sampling for Long Video Understanding
- 视频帧插值
- EDEN: Enhanced Diffusion for High-quality Large-motion Video Frame Interpolation
- Hierarchical Flow Diffusion for Efficient Frame Interpolation
:star:code - Explicit Depth-Aware Blurry Video Frame Interpolation Guided by Differential Curves
- TimeTracker: Event-based Continuous Point Tracking for Video Frame Interpolation with Non-linear Motion
- BiM-VFI: Bidirectional Motion Field-Guided Frame Interpolation for Video with Non-uniform Motions
- Video Decomposition
- VAD
- VERA: Explainable Video Anomaly Detection via Verbalized Learning of Vision-Language Models
- Just Dance with pi! A Poly-modal Inductor for Weakly-supervised Video Anomaly Detection
- Noise-Resistant Video Anomaly Detection via RGB Error-Guided Multiscale Predictive Coding and Dynamic Memory
- Anomize: Better Open Vocabulary Video Anomaly Detection
- Holmes-VAU: Towards Long-term Video Anomaly Understanding at Any Granularity
- Track Any Anomalous Object:A Granular Video Anomaly Detection Pipeline
- 视频分析
- SALOVA: Segment-Augmented Long Video Assistant for Targeted Retrieval and Routing in Long-Form Video Analysis
- Building a Mind Palace: Structuring Environment-Grounded Semantic Graphs for Effective Long Video Analysis with LLMs
- VideoAutoArena: An Automated Arena for Evaluating Large Multimodal Models in Video Analysis through User Simulation
- Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis
- 视频摘要
- 视频识别
- 视频分类
23.OCR
- CLIP is Almost All You Need: Towards Parameter-Efficient Scene Text Retrieval without OCR
- SemiETS: Integrating Spatial and Content Consistencies for Semi-Supervised End-to-end Text Spotting
- DreamText: High Fidelity Scene Text Synthesis
- RoomPainter: View-Integrated Diffusion for Consistent Indoor Scene Texturing
- 场景文本识别
- 场景文本编辑
- 手写文本识别
- 文档理解
- 公式识别
22.3D(三维重建\三维视觉)
- CADDreamer: CAD object Generation from Single-view Images
- Leveraging 3D Geometric Priors in 2D Rotation Symmetry Detection
- HoGS: Unified Near and Far Object Reconstruction via Homogeneous Gaussian Splatting
:star:code - PhysGen3D: Crafting a Miniature Interactive World from a Single Image
:star:code - Stable-SCore: A Stable Registration-based Framework for 3D Shape Correspondence
:star:code - On-Device Self-Supervised Learning of Low-Latency Monocular Depth from Only Events
- Common3D: Self-Supervised Learning of 3D Morphable Models for Common Objects in Neural Feature Space
- FreeScene: Mixed Graph Diffusion for 3D Scene Synthesis from Free Prompts
- GREAT: Geometry-Intention Collaborative Inference for Open-Vocabulary 3D Object Affordance Grounding
- FirePlace: Geometric Refinements of LLM Common Sense Reasoning for 3D Object Placement
- High-fidelity 3D Object Generation from Single Image with RGBN-Volume Gaussian Reconstruction Model
- Instant3dit: Multiview Inpainting for Fast Editing of 3D Objects
- SPAR3D: Stable Point-Aware Reconstruction of 3D Objects from Single Images
- Material Anything: Generating Materials for Any 3D Object via Diffusion
- One-shot 3D Object Canonicalization based on Geometric and Semantic Consistency
- SemAlign3D: Semantic Correspondence between RGB-Images through Aligning 3D Object-Class Representations
:house:project
:house:project - Grounding 3D Object Affordance with Language Instructions, Visual Observations and Interactions
:house:project - UrbanCAD: Towards Highly Controllable and Photorealistic 3D Vehicles for Urban Scene Simulation
- SF3D: Stable Fast 3D Mesh Reconstruction with UV-unwrapping and Illumination Disentanglement
- HiMoR: Monocular Deformable Gaussian Reconstruction with Hierarchical Motion Representation
:star:code - UniK3D: Universal Camera Monocular 3D Estimation
- Fancy123: One Image to High-Quality 3D Mesh Generation via Plug-and-Play Deformation
- DepthCues: Evaluating Monocular Depth Perception in Large Vision Models
- Mono3DVLT: Monocular-Video-Based 3D Visual Language Tracking
- Neuro-3D: Towards 3D Visual Decoding from EEG Signals
- SeeGround: See and Ground for Zero-Shot Open-Vocabulary 3D Visual Grounding
- ReasonGrounder: LVLM-Guided Hierarchical Feature Splatting for Open-Vocabulary 3D Visual Grounding and Reasoning
- ShapeWords: Guiding Text-to-Image Synthesis with 3D Shape-Aware Prompts
- PrEditor3D: Fast and Precise 3D Shape Editing
- Robust 3D Shape Reconstruction in Zero-Shot from a Single Image in the Wild
- Dora: Sampling and Benchmarking for 3D Shape Variational Auto-Encoders
- DualPM: Dual Posed-Canonical Point Maps for 3D Shape and Pose Reconstruction
- PyTorchGeoNodes: Enabling Differentiable Shape Programs for 3D Shape Reconstruction
- CrossOver: 3D Scene Cross-Modal Alignment
- Acc3D: Accelerating Single Image to 3D Diffusion Models via Edge Consistency Guided Score Distillation
- StdGEN: Semantic-Decomposed 3D Character Generation from Single Images
- Dense Dispersed Structured Light for Hyperspectral 3D Imaging of Dynamic Scenes
- 3DGS
- SOGS: Second-Order Anchor for Advanced 3D Gaussian Splatting
- NexusSplats: Efficient 3D Gaussian Splatting in the Wild
- S2Gaussian: Sparse-View Super-Resolution 3D Gaussian Splatting
- DoF-Gaussian: Controllable Depth-of-Field for 3D Gaussian Splatting
:star:code - DashGaussian: Optimizing 3D Gaussian Splatting in 200 Seconds
:star:code - Mitigating Ambiguities in 3D Classification with Gaussian Splatting
- Taming Video Diffusion Prior with Scene-Grounding Guidance for 3D Gaussian Splatting from Sparse Inputs
:star:code - BARD-GS: Blur-Aware Reconstruction of Dynamic Scenes via Gaussian Splatting
:star:code - GaussHDR: High Dynamic Range Gaussian Splatting via Learning Unified 3D and 2D Local Tone Mapping
:star:code - GaussianLSS -- Toward Real-world BEV Perception: Depth Uncertainty Estimation via Gaussian Splatting
- Luminance-GS: Adapting 3D Gaussian Splatting to Challenging Lighting Conditions with View-Adaptive Curve Adjustment
:star:code - EnliveningGS: Active Locomotion of 3DGS
- POp-GS: Next Best View in 3D-Gaussian Splatting with P-Optimality
- FruitNinja: 3D Object Interior Texture Generation with Gaussian Splatting
- iSegMan: Interactive Segment-and-Manipulate 3D Gaussians
:star:code - 3D-HGS: 3D Half-Gaussian Splatting
- Generative Gaussian Splatting for Unbounded 3D City Generation
- FlexGS: Train Once, Deploy Everywhere with Many-in-One Flexible 3D Gaussian Splatting
- MaskGaussian: Adaptive 3D Gaussian Representation from Probabilistic Masks
- SplineGS: Robust Motion-Adaptive Spline for Real-Time Dynamic 3D Gaussians from Monocular Video
- Towards Realistic Example-based Modeling via 3D Gaussian Stitching
- 3D Gaussian Inpainting with Depth-Guided Cross-View Consistency
- RigGS: Rigging of 3D Gaussians for Modeling Articulated Objects in Videos
- Volumetrically Consistent 3D Gaussian Rasterization
- Morpheus: Text-Driven 3D Gaussian Splat Shape and Color Stylization
- SpecTRe-GS: Modeling Highly Specular Surfaces with Reflected Nearby Objects by Tracing Rays in 3D Gaussian Splatting
- FlashGS: Efficient 3D Gaussian Splatting for Large-scale and High-resolution Rendering
- HybridGS: Decoupling Transients and Statics with 2D and 3D Gaussian Splatting
- ArticulatedGS: Self-supervised Digital Twin Modeling of Articulated Objects using 3D Gaussian Splatting
- GaussianSpa: An "Optimizing-Sparsifying" Simplification Framework for Compact and High-Quality 3D Gaussian Splatting
- Steepest Descent Density Control for Compact 3D Gaussian Splatting
- GaussianUDF: Inferring Unsigned Distance Functions through 3D Gaussian Splatting
- PUP 3D-GS: Principled Uncertainty Pruning for 3D Gaussian Splatting
- SplatFlow: Multi-View Rectified Flow Model for 3D Gaussian Splatting Synthesis
- Horizon-GS: Unified 3D Gaussian Splatting for Large-Scale Aerial-to-Ground Scenes
- Efficient Decoupled Feature 3D Gaussian Splatting via Hierarchical Compression
- EditSplat: Multi-View Fusion and Attention-Guided Optimization for View-Consistent 3D Scene Editing with 3D Gaussian Splatting
- CTRL-D: Controllable Dynamic 3D Scene Editing with Personalized 2D Diffusion
- HyperGS: Hyperspectral 3D Gaussian Splatting
- OmniSplat: Taming Feed-Forward 3D Gaussian Splatting for Omnidirectional Images with Editable Capabilities
- BIGS: Bimanual Category-agnostic Interaction Reconstruction from Monocular Videos via 3D Gaussian Splatting
- SelfSplat: Pose-Free and 3D Prior-Free Generalizable 3D Gaussian Splatting
- Chain of Semantics Programming in 3D Gaussian Splatting Representation for 3D Vision Grounding
- EventSplat: 3D Gaussian Splatting from Moving Event Cameras for Real-time Rendering
- SfM-Free 3D Gaussian Splatting via Hierarchical Training
- MonoSplat: Generalizable 3D Gaussian Splatting from Monocular Depth Foundation Models
- Dr. Splat: Directly Referring 3D Gaussian Splatting via Direct Language Embedding Registration
- UVGS: Reimagining Unstructured 3D Gaussian Splatting using UV Mapping
- MoDec-GS: Global-to-Local Motion Decomposition and Temporal Interval Adjustment for Compact Dynamic 3D Gaussian Splatting
- TexGaussian: Generating High-quality PBR Material via Octree-based 3D Gaussian Splatting
- EAP-GS: Efficient Augmentation of Pointcloud for 3D Gaussian Splatting in Few-shot Scene Reconstruction
- DOF-GS: Adjustable Depth-of-Field 3D Gaussian Splatting for Post-Capture Refocusing, Defocus Rendering and Blur Removal
- Speedy-Splat: Fast 3D Gaussian Splatting with Sparse Pixels and Sparse Primitives
- Stereo Matching
- 三维重建
- M3D: Dual-Stream Selective State Spaces and Depth-Driven Framework for High-Fidelity Single-View 3D Reconstruction
- MESC-3D:Mining Effective Semantic Cues for 3D Reconstruction from a Single Image
:star:code - MUSt3R: Multi-view Network for Stereo 3D Reconstruction
- Difix3D+: Improving 3D Reconstructions with Single-Step Diffusion Models
- FluidNexus: 3D Fluid Reconstruction and Prediction from a Single Video
:house:project - Pow3R: Empowering Unconstrained 3D Reconstruction with Camera and Scene Priors
:house:project - Thin-Shell-SfT: Fine-Grained Monocular Non-rigid 3D Surface Tracking with Neural Deformation Fields
:house:project - Glossy Object Reconstruction with Cost-effective Polarized Acquisition
- CrossSDF: 3D Reconstruction of Thin Structures From Cross-Sections
- Touch2Shape: Touch-Conditioned 3D Diffusion for Shape Exploration and Reconstruction
- Towards In-the-wild 3D Plane Reconstruction from a Single Image
- MAC-Ego3D: Multi-Agent Gaussian Consensus for Real-Time Collaborative Ego-Motion and Photorealistic 3D Reconstruction
- Generative Densification: Learning to Densify Gaussians for High-Fidelity Generalizable 3D Reconstruction
- ODHSR: Online Dense 3D Reconstruction of Humans and Scenes from Monocular Videos
- Fast3R: Towards 3D Reconstruction of 1000+ Images in One Forward Pass
- DroneSplat: 3D Gaussian Splatting for Robust 3D Reconstruction from In-the-Wild Drone Imagery
- SPARS3R: Semantic Prior Alignment and Regularization for Sparse 3D Reconstruction
- Generative Multiview Relighting for 3D Reconstruction under Extreme Illumination Variation
- V2V3D: View-to-View Denoised 3D Reconstruction for Light Field Microscopy
- MASt3R-SLAM: Real-Time Dense SLAM with 3D Reconstruction Priors
- GaPT-DAR: Category-level Garments Pose Tracking via Integrated 2D Deformation and 3D Reconstruction
- A Lightweight UDF Learning Framework for 3D Reconstruction Based on Local Shape Functions
- MVBoost: Boost 3D Reconstruction with Multi-View Refinement
- SpectroMotion: Dynamic 3D Reconstruction of Specular Scenes
- Shading Meets Motion: Self-supervised Indoor 3D Reconstruction Via Simultaneous Shape-from-Shading and Structure-from-Motion
- Learning Partonomic 3D Reconstruction from Image Collections
- AniGrad: Anisotropic Gradient-Adaptive Sampling for 3D Reconstruction From Monocular Video
- 深度补全
- ProtoDepth: Unsupervised Continual Depth Completion with Prototypes
- SVDC: Consistent Direct Time-of-Flight Video Depth Completion with Frequency Selective Fusion
:star:code - Completion as Enhancement: A Degradation-Aware Selective Image Guided Network for Depth Completion
- Distilling Monocular Foundation Model for Fine-grained Depth Completion
- 深度估计
- Multi-view Reconstruction via SfM-guided Monocular Depth Estimation
:star:code - QuartDepth: Post-Training Quantization for Real-Time Depth Estimation on the Edge
:star:code - Blurry-Edges: Photon-Limited Depth Estimation from Defocused Boundaries
:house:project - Scalable Autoregressive Monocular Depth Estimation
- Depth Any Camera: Zero-Shot Metric Depth Estimation from Any Camera
- Synthetic-to-Real Self-supervised Robust Depth Estimation via Learning with Motion and Structure Priors
- OmniStereo: Real-time Omnidireactional Depth Estimation with Multiview Fisheye Cameras
- Align3R: Aligned Monocular Depth Estimation for Dynamic Videos
- Efficient Depth Estimation for Unstable Stereo Camera Systems on AR Glasses
- Prompting Depth Anything for 4K Resolution Accurate Metric Depth Estimation
- GeoDepth: From Point-to-Depth to Plane-to-Depth Modeling for Self-Supervised Monocular Depth Estimation
- BLADE: Single-view Body Mesh Estimation through Accurate Depth Estimation
- Vision-Language Embodiment for Monocular Depth Estimation
- TacoDepth: Towards Efficient Radar-Camera Depth Estimation with One-stage Fusion
- Video Depth Anything: Consistent Depth Estimation for Super-Long Videos
- HELVIPAD: A Real-World Dataset for Omnidirectional Stereo Depth Estimation
- Multi-view Reconstruction via SfM-guided Monocular Depth Estimation
- 场景理解
- Inst3D-LMM: Instance-Aware 3D Scene Understanding with Multi-modal Instruction Tuning
:star:code - Cross-Modal and Uncertainty-Aware Agglomeration for Open-Vocabulary 3D Scene Understanding
:star:code - Masked Scene Modeling: Narrowing the Gap Between Supervised and Self-Supervised Learning in 3D Scene Understanding
:star:code - FALCON: Fairness Learning via Contrastive Attention Approach to Continual Semantic Scene Understanding
- LSceneLLM: Enhancing Large 3D Scene Understanding Using Adaptive Visual Preferences
- Masked Point-Entity Contrast for Open-Vocabulary 3D Scene Understanding
- Embodied Scene Understanding for Vision Language Models via MetaVQA
- HUSH: Holistic Panoramic 3D Scene Understanding using Spherical Harmonics
- ARKit LabelMaker: A New Scale for Indoor 3D Scene Understanding
- Video-3D LLM: Learning Position-Aware Video Representation for 3D Scene Understanding
- Inst3D-LMM: Instance-Aware 3D Scene Understanding with Multi-modal Instruction Tuning
- 场景重建
- Decompositional Neural Scene Reconstruction with Generative Diffusion Prior
:star:code - Scene4U: Hierarchical Layered 3D Scene Reconstruction from Single Panoramic Image for Your Immerse Exploration
:star:code - Omni-Scene: Omni-Gaussian Representation for Ego-Centric Sparse-View Scene Reconstruction
- NeRFPrior: Learning Neural Radiance Field as a Prior for Indoor Scene Reconstruction
:star:code - SLAM3R: Real-Time Dense Scene Reconstruction from Monocular RGB Videos
- Instant Gaussian Stream: Fast and Generalizable Streaming of Dynamic Scene Reconstruction via Gaussian Splatting
- ReconDreamer: Crafting World Models for Driving Scene Reconstruction via Online Restoration
- IndoorGS: Geometric Cues Guided Gaussian Splatting for Indoor Scene Reconstruction
- MV-DUSt3R+: Single-Stage Scene Reconstruction from Sparse Views In 2 Seconds
- FreeTimeGS: Free Gaussian Primitives at Anytime Anywhere for Dynamic Scene Reconstruction
- MegaSynth: Scaling Up 3D Scene Reconstruction with Synthesized Data
- Decompositional Neural Scene Reconstruction with Generative Diffusion Prior
- 表面重建
- OffsetOPT: Explicit Surface Reconstruction without Normals
- ViiNeuS: Volumetric Initialization for Implicit Neural Surface Reconstruction of Urban Scenes with Limited Image Overlap
- PlanarSplatting: Accurate Planar Surface Reconstruction in 3 Minutes
- ProbeSDF: Light Field Probes For Neural Surface Reconstruction
- DeSiRe-GS: 4D Street Gaussians for Static-Dynamic Decomposition and Surface Reconstruction for Urban Driving Scenes
- PMNI: Pose-free Multi-view Normal Integration for Reflective and Textureless Surface Reconstruction
- Sparse2DGS: Geometry-Prioritized Gaussian Splatting for Surface Reconstruction from Sparse Views
- 三维场景合成
- Global-Local Tree Search for Language Guided 3D Scene Generation
:star:code - SceneFactor: Factored Latent 3D Diffusion for Controllable 3D Scene Generation
- MIDI: Multi-Instance Diffusion for Single Image to 3D Scene Generation
- Scene Splatter: Momentum 3D Scene Generation from Single Image with Video Diffusion Model
- WonderWorld: Interactive 3D Scene Generation from a Single Image
- ArtiScene: Language-Driven Artistic 3D Scene Generation Through Image Intermediary
- Global-Local Tree Search in VLMs for 3D Indoor Scene Generation
- StarGen: A Spatiotemporal Autoregression Framework with Video Diffusion Model for Scalable and Controllable Scene Generation
- UniScene: Unified Occupancy-centric Driving Scene Generation
- Global-Local Tree Search for Language Guided 3D Scene Generation
- 3D头发
- 语义场景补全
- 三维场景恢复
- 相对姿态估计
- 运动恢复结构
- 三维场景
21.Point Cloud(点云)
- STAR-Edge: Structure-aware Local Spherical Curve Representation for Thin-walled Edge Extraction from Unstructured Point Clouds
- Point-Cache: Test-time Dynamic and Hierarchical Cache for Robust and Generalizable Point Cloud Analysis
:star:code - Learning Bijective Surface Parameterization for Inferring Signed Distance Functions from Sparse Point Clouds with Grid Deformation
:star:code - PointLoRA: Low-Rank Adaptation with Token Selection for Point Cloud Learning
:star:code - Cross-Modal 3D Representation with Multi-View Images and Point Clouds
- High-quality Point Cloud Oriented Normal Estimation via Hybrid Angular and Euclidean Distance Encoding
- DeepLA-Net: Very Deep Local Aggregation Networks for Point Cloud Analysis
- BWFormer: Building Wireframe Reconstruction from Airborne LiDAR Point Cloud with Transformer
- High-Fidelity Lightweight Mesh Reconstruction from Point Clouds
- WeatherGen: A Unified Diverse Weather Generator for LiDAR Point Clouds via Spider Mamba Diffusion
- TopNet: Transformer-Efficient Occupancy Prediction Network for Octree-Structured Point Cloud Geometry Compression
- Spectral Informed Mamba for Robust Point Cloud Processing
- SAMBLE: Shape-Specific Point Cloud Sampling for an Optimal Trade-Off Between Local Detail and Global Uniformity
- LeanGaussian: Breaking Pixel or Point Cloud Correspondence in Modeling 3D Gaussians
- SASep: Saliency-Aware Structured Separation of Geometry and Feature for Open Set Learning on Point Clouds
- Generalized Gaussian Entropy Model for Point Cloud Attribute Compression with Dynamic Likelihood Intervals
- Multi-Scale Neighborhood Occupancy Masked Autoencoder for Self-Supervised Learning in LiDAR Point Clouds
- Sparse Point Cloud Patches Rendering via Splitting 2D Gaussians
- EdgeDiff: Edge-aware Diffusion Network for Building Reconstruction from Point Clouds
- NoPain: No-box Point Cloud Attack via Optimal Transport Singular Boundary
- DV-Matcher: Deformation-based Non-rigid Point Cloud Matching Guided by Pre-trained Visual Features
- PSA-SSL: Pose and Size-aware Self-Supervised Learning on LiDAR Point Clouds
- Point Cloud Upsampling Using Conditional Diffusion Module with Adaptive Noise Suppression
- 点云分割
- 点云配准
- Unlocking Generalization Power in LiDAR Point Cloud Registration
:star:code - AutoURDF: Unsupervised Robot Modeling from Point Cloud Frames Using Cluster Registration
- ColabSfM: Collaborative Structure-from-Motion by Point Cloud Registration
:star:code - Dual Focus-Attention Transformer for Robust Point Cloud Registration
- GraphI2P: Image-to-Point Cloud Registration with Exploring Pattern of Correspondence via Graph Learning
- Zero-shot RGB-D Point Cloud Registration with Pre-trained Large Vision Model
- HeMoRa: Unsupervised Heuristic Consensus Sampling for Robust Point Cloud Registration
- Implicit Correspondence Learning for Image-to-Point Cloud Registration
- Unlocking Generalization Power in LiDAR Point Cloud Registration
- 点云补全
- GenPC: Zero-shot Point Cloud Completion via 3D Generative Priors
- Self-Supervised Large Scale Point Cloud Completion for Archaeological Site Restoration
- Parametric Point Cloud Completion for Polygonal Surface Reconstruction
:star:code - PCDreamer: Point Cloud Completion Through Multi-view Diffusion Priors
- SuperPC: A Single Diffusion Model for Point Cloud Completion, Upsampling, Denoising, and Colorization
- 3D点云
- MICAS: Multi-grained In-Context Adaptive Sampling for 3D Point Cloud Processing
- Consistent Normal Orientation for 3D Point Clouds via Least Squares on Delaunay Graph
- RENO: Real-Time Neural Compression for 3D LiDAR Point Clouds
- A Unified Approach to Interpreting Self-supervised Pre-training Methods for 3D Point Clouds via Interactions
- ProxyTransformation: Preshaping Point Cloud Manifold With Proxy Attention For 3D Visual Grounding
- UniPre3D: Unified Pre-training of 3D Point Cloud Models with Cross-Modal Gaussian Splatting
- 点云理解
- 点云+OD
- 点云+视频理解
- 点云+GR
- 点云异常检测
- 点云重建
20.Visual Question Answering(视觉问答)
- DSPNet: Dual-vision Scene Perception for Robust 3D Question Answering
- CL-MoE: Enhancing Multimodal Large Language Model with Dual Momentum Mixture-of-Experts for Continual Visual Question Answering
- Marten: Visual Question Answering with Mask Generation for Multi-modal Document Understanding
:star:code - Notes-guided MLLM Reasoning: Enhancing MLLM with Knowledge and Visual Notes for Visual Question Answering
- Separation of Powers: On Segregating Knowledge from Observation in LLM-enabled Knowledge-based Visual Question Answering
- FRAMES-VQA: Benchmarking Fine-Tuning Robustness across Multi-Modal Shifts in Visual Question Answering
- Augmenting Multimodal LLMs with Self-Reflective Tokens for Knowledge-based Visual Question Answering
- Video-QA
- Cross-modal Causal Relation Alignment for Video Question Grounding
:star:code - BIMBA: Selective-Scan Compression for Long-Range Video Question Answering
:house:project - EgoTextVQA: Towards Egocentric Scene-Text Aware Video Question Answering
- Commonsense Video Question Answering through Video-Grounded Entailment Tree Reasoning
- Cross-modal Causal Relation Alignment for Video Question Grounding
- 视听问答
19.UAV/RS/Satellite Image(无人机/遥感/卫星图像)
- ROS-SAM: High-Quality Interactive Segmentation for Remote Sensing Moving Object
:star:code - A General Adaptive Dual-level Weighting Mechanism for Remote Sensing Pansharpening
:star:code - HyperFree: A Channel-adaptive and Tuning-free Foundation Model for Hyperspectral Remote Sensing Imagery
:house:project - ABBSPO: Adaptive Bounding Box Scaling and Symmetric Prior based Orientation Prediction for Detecting Aerial Image Objects
- XLRS-Bench: Could Your Multimodal LLMs Understand Extremely Large Ultra-High-Resolution Remote Sensing Imagery?
- Adaptive Rectangular Convolution for Remote Sensing Pansharpening
- AeroGen: Enhancing Remote Sensing Object Detection with Diffusion-Driven Data Generation
- Effective Cloud Removal for Remote Sensing Images by an Improved Mean-Reverting Denoising Model with Elucidated Design Space
- RobSense: A Robust Multi-modal Foundation Model for Remote Sensing with Static, Temporal, and Incomplete Data Adaptability
- SegEarth-OV: Towards Training-Free Open-Vocabulary Segmentation for Remote Sensing Images
- SkySense-O: Towards Open-World Remote Sensing Interpretation with Vision-Centric Visual-Language Modeling
- Satellite Observations Guided Diffusion Model for Accurate Meteorological States at Arbitrary Resolution
- Gaussian Splatting for Efficient Satellite Image Photogrammetry
- SGFormer: Satellite-Ground Fusion for 3D Semantic Scene Completion
- Towards Satellite Image Road Graph Extraction: A Global-Scale Dataset and A Novel Method
- Satellite to GroundScape - Large-scale Consistent Ground View Generation from Satellite Views
- MFogHub: Bridging Multi-Regional and Multi-Satellite Data for Global Marine Fog Detection and Forecasting
- Cross-Rejective Open-Set SAR Image Registration
- 变化检测
- Change3D: Revisiting Change Detection and Captioning from A Video Modeling Perspective
- The Change You Want To Detect: Semantic Change Detection In Earth Observation With Hybrid Data Generationf
- Feature Spectrum Learning for Remote Sensing Change Detection
- Deep Change Monitoring: A Hyperbolic Representative Learning Framework and a Dataset for Long-term Fine-grained Tree Change Detection
- 目标检测
- 无人机跟踪
- Anti-UAV
18.Person Re-id(人员重识别)
- SapiensID: Foundation for Human Recognition
- AG-VPReID: A Challenging Large-Scale Benchmark for Aerial-Ground Video-based Person Re-Identification
:star:code - From Poses to Identity: Training-Free Person Re-Identification via Feature Centralization
- Cheb-GR: Rethinking K-nearest Neighbor Search in Re-ranking for Person Re-identification
- SeCap: Self-Calibrating and Adaptive Prompts for Cross-view Person Re-Identification in Aerial-Ground Networks
- 文本-图像重识别
- 可见光红外重识别
- 换衣重识别
- 终身重识别
- 计数
- 步态识别
- 人员检索
- 人员搜索
- 行人属性识别
- 去身份识别
- 人群行为生成
17.Human-Object Interactions(人机交互)
- HORP: Human-Object Relation Priors Guided HOI Detection
- InterMimic: Towards Universal Whole-Body Control for Physics-Based Human-Object Interactions
:star:code - REWIND: Real-Time Egocentric Whole-Body Motion Diffusion with Exemplar-Based Identity Conditioning
:star:code - SGC-Net: Stratified Granular Comparison Network for Open-Vocabulary HOI Detection
:star:code - ChainHOI: Joint-based Kinematic Chain Modeling for Human-Object Interaction Generation
:star:code - Reconstructing In-the-Wild Open-Vocabulary Human-Object Interactions
:star:code - An Image-like Diffusion Method for Human-Object Interaction Detection
- Guiding Human-Object Interactions with Rich Geometry and Relations
:star:code - HOIGen-1M: A Large-scale Dataset for Human-Object Interaction Video Generation
:star:code - ParaHome: Parameterizing Everyday Home Activities Towards 3D Generative Modeling of Human-Object Interactions
- Locality-Aware Zero-Shot Human-Object Interaction Detection
- InterAct: Advancing Large-Scale Versatile 3D Human-Object Interaction Generation
- End-to-End HOI Reconstruction Transformer with Graph-based Encoding
- InteractAnything: Zero-shot Human Object Interaction Synthesis via LLM Feedback and Object Affordance Parsing
- 人-场景交互
16.Human Motion Generation(人体运动生成)
- DiffusionSfM: Predicting Structure and Motion via Ray Origin and Endpoint Diffusion
- HumanDreamer: Generating Controllable Human-Motion Videos via Decoupled Generation
- Dynamic Motion Blending for Versatile Motion Editing
- Continuous Space-Time Video Resampling with Invertible Motion Steganography
- MegaSaM: Accurate, Fast and Robust Structure and Motion from Casual Dynamic Videos
- MG-MotionLLM: A Unified Framework for Motion Comprehension and Generation across Multiple Granularities
- Spk2SRImgNet: Super-Resolve Dynamic Scene from Spike Stream via Motion Aligned Collaborative Filtering
- ModeSeq: Taming Sparse Multimodal Motion Prediction with Sequential Mode Modeling
- HuMoCon: Concept Discovery for Human Motion Understanding
- StickMotion: Generating 3D Human Motions by Drawing a Stickman
- SemGeoMo: Dynamic Contextual Human Motion Generation with Semantic and Geometric Guidance
:star:code - POMP: Physics-consistent Motion Generative Model through Phase Manifolds
- ScaMo: Exploring the Scaling Law in Autoregressive Motion Generation Model
- Disco4D: Disentangled 4D Human Generation and Animation from a Single Image
- MODA: Motion-Drift Augmentation for Inertial Human Motion Analysis
- MEAT: Multiview Diffusion Model for Human Generation on Megapixels with Mesh Attention
:house:project
:star:code - GaussianIP: Identity-Preserving Realistic 3D Human Generation via Human-Centric Diffusion Prior
:star:code - MixerMDM: Learnable Composition of Human Motion Diffusion Models
:house:project - From Sparse Signal to Smooth Motion: Real-Time Motion Generation with Rolling Prediction Models
:star:code - FinePhys: Fine-grained Human Action Generation by Explicitly Incorporating Physical Laws for Effective Skeletal Guidance
- The Language of Motion: Unifying Verbal and Non-verbal Language of 3D Human Motion
- EnergyMoGen: Compositional Human Motion Generation with Energy-Based Diffusion Model in Latent Space
- Deterministic-to-Stochastic Diverse Latent Feature Mapping for Human Motion Synthesis
- Gaussian Splashing: Unified Particles for Versatile Motion Synthesis and Rendering
- Move-in-2D: 2D-Conditioned Human Motion Generation
- Human Motion Instruction Tuning
- TIMotion: Temporal and Interactive Framework for Efficient Human-Human Motion Generation
- 文本驱动的运动生成
- 人体运动恢复
- 人体运动预测
- ALIEN: Implicit Neural Representations for Human Motion Prediction under Arbitrary Latency
- Vision-Guided Action: Enhancing 3D Human Motion Prediction with Gaze-informed Affordance in 3D Scenes
- SimMotionEdit: Text-Based Human Motion Editing with Motion Similarity Prediction
- Stochastic Human Motion Prediction with Memory of Action Transition and Action Characteristic
- Nonisotropic Gaussian Diffusion for Realistic 3D Human Motion Prediction
- LAL: Enhancing 3D Human Motion Prediction with Latency-aware Auxiliary Learning
15.Action Detection(动作检测)
- Are Spatial-Temporal Graph Convolution Networks for Human Action Recognition Over-Parameterized?
- Heterogeneous Skeleton-Based Action Representation Learning
- H-MoRe: Learning Human-centric Motion Representation for Action Analysis
- VideoGEM: Training-free Action Grounding in Videos
- Modeling Multiple Normal Action Representations for Error Detection in Procedural Tasks
- 基于骨架的动作识别
- 小样本动作识别
- 零样本动作识别
- 动作计数
- 动作检测
- 时序动作检测
- 时序动作定位
- Bridge the Gap: From Weak to Full Supervision for Temporal Action Localization with PseudoFormer
- Weakly Supervised Temporal Action Localization via Dual-Prior Collaborative Learning Guided by Multimodal Large Language Models
- Boosting Point-Supervised Temporal Action Localization through Integrating Query Reformation and Optimal Transport
- Action Anticipation(动作预期)
14.Human Pose Estimation(姿态估计)
- Visual Persona: Foundation Model for Full-Body Human Customization
:star:code - TaoAvatar: Real-Time Lifelike Full-Body Talking Avatars for Augmented Reality via 3D Gaussian Splatting
:star:code - Disentangled Pose and Appearance Guidance for Multi-Pose Generation
- IDOL: Instant Photorealistic 3D Human Creation from a Single Image
- ChatHuman: Chatting about 3D Humans with Tools
- UniPose: A Unified Multimodal Framework for Human Pose Comprehension, Generation and Editing
- MotionMap: Representing Multimodality in Human Pose Forecasting
- Towards Human-Understandable Multi-Dimensional Concept Discovery
- 人体姿态估计
- PoseBH: Prototypical Multi-Dataset Training Beyond Human Pose Estimation
- DynPose: Largely Improving the Efficiency of Human Pose Estimation by a Simple Dynamic Framework
- ProbPose: A Probabilistic Approach to 2D Human Pose Estimation
- MVDoppler-Pose: Multi-Modal Multi-View mmWave Sensing for Long-Distance Self-Occluded Human Walking Pose Estimation
- 3DHPE
- 人体重建
- PICO: Reconstructing 3D People In Contact with Objects
:house:project - DeClotH: Decomposable 3D Cloth and Human Body Reconstruction from a Single Image
:star:code - Reconstructing Humans with a Biomechanically Accurate Skeleton
:star:code - InteractVLM: 3D Interaction Reasoning from 2D Foundational Models
:house:project - Link to the Past: Temporal Propagation for Fast 3D Human Reconstruction from Monocular Video
- PSHuman: Photorealistic Single-image 3D Human Reconstruction using Cross-Scale Multiview Diffusion and Explicit Remeshing
- MultiGO: Towards Multi-level Geometry Learning for Monocular 3D Textured Human Reconstruction
- 人体形状重建
- PICO: Reconstructing 3D People In Contact with Objects
- 手势合成
- HOP: Heterogeneous Topology-based Multimodal Entanglement for Co-Speech Gesture Generation
:star:code - Ges3ViG : Incorporating Pointing Gestures into Language-Based 3D Visual Grounding for Embodied Reference Understanding
- SocialGesture: Delving into Multi-person Gesture Understanding
- Retrieving Semantics from the Deep: an RAG Solution for Gesture Synthesis
- HOP: Heterogeneous Topology-based Multimodal Entanglement for Co-Speech Gesture Generation
- 手部姿态估计
- 运动捕捉
- 运动估计
- 人体网格恢复
- 手部运动合成/重建
- Blurred LiDAR for Sharper 3D: Robust Handheld 3D Scanning with Diffuse LiDAR and RGB
- HaWoR: World-Space Hand Motion Reconstruction from Egocentric Videos
- Dyn-HaMR: Recovering 4D Interacting Hand Motion from a Dynamic Camera
- Pose-Guided Temporal Enhancement for Robust Low-Resolution Hand Reconstruction
- How Do I Do That? Synthesizing 3D Hand Motion and Contacts for Everyday Interactions
- HandOS: 3D Hand Reconstruction in One Stage
- Estimating Body and Hand Motion in an Ego-sensed World
- WiLoR: End-to-end 3D Hand Localization and Reconstruction in-the-wild
- 手语
13.Medical Image Progress(医学影响处理)
- MedUnifier: Unifying Vision-and-Language Pre-training on Medical Data with Vision Generation Task using Discrete Visual Representations
- OpenMIBOOD: Open Medical Imaging Benchmarks for Out-Of-Distribution Detection
- MIMO: A Medical Vision Language Model with Visual Referring Multimodal Input and Pixel Grounding Multimodal Output
- Multi-modal Medical Diagnosis via Large-small Model Collaboration
- VILA-M3: Enhancing Vision-Language Models with Medical Expert Knowledge
- WISE: A Framework for Gigapixel Whole-Slide-Image Lossless Compression
- SlideChat: A Large Vision-Language Assistant for Whole-Slide Pathology Image Understanding
- Learning Heterogeneous Tissues with Mixture of Experts for Gigapixel Whole Slide Images
- CPath-Omni: A Unified Multimodal Foundation Model for Patch and Whole Slide Image Analysis in Computational Pathology
- BiomedCoOp: Learning to Prompt for Biomedical Vision-Language Models
- CT 去噪
- 肿瘤分割
- LesionLocator: Zero-Shot Universal Tumor Segmentation and Tracking in 3D Whole-Body Imaging
- Cross-Modal Interactive Perception Network with Mamba for Lung Tumor Segmentation in PET-CT Images
:star:code - KMD: Koopman Multi-modality Decomposition for Generalized Brain Tumor Segmentation under Incomplete Modalities
- Advancing Generalizable Tumor Segmentation with Anomaly-Aware Open-Vocabulary Attention Maps and Frozen Foundation Diffusion Models
- CSC-PA: Cross-image Semantic Correlation via Prototype Attentions for Single-network Semi-supervised Breast Tumor Segmentation
- Incomplete Multi-modal Brain Tumor Segmentation via Learnable Sorting State Space Model
- SuperLightNet: Lightweight Parameter Aggregation Network for Multimodal Brain Tumor Segmentation
- X射线
- Enhanced Contrastive Learning with Multi-view Longitudinal Data for Chest X-ray Report Generation
- CXPMRG-Bench: Pre-training and Benchmarking for X-ray Medical Report Generation on CheXpert Plus Dataset
- Dual-view X-ray Detection: Can AI Detect Prohibited Items from Dual-view X-ray Images like Humans?
- CheXwhatsApp: A Dataset for Exploring Challenges in the Diagnosis of Chest X-rays through Mobile Devices
- FactCheXcker: Mitigating Measurement Hallucinations in Chest X-ray Report Generation Models
- 全切片分类
- MExD: An Expert-Infused Diffusion Model for Whole-Slide Image Classification
- HistoFS: Non-IID Histopathologic Whole Slide Image Classification via Federated Style Transfer with RoI-Preserving
- FOCUS: Knowledge-enhanced Adaptive Visual Compression for Few-shot Whole Slide Image Classification
- M3amba: Memory Mamba is All You Need for Whole Slide Image Classification
- 2DMamba: Efficient State Space Model for Image Representation with Applications on Giga-Pixel Whole Slide Image Classification
- 医学图像配准
- 医学图像分割
- Enhancing SAM with Efficient Prompting and Preference Optimization for Semi-supervised Medical Image Segmentation
- Show and Segment: Universal Medical Image Segmentation via In-Context Learning
- Steady Progress Beats Stagnation: Mutual Aid of Foundation and Conventional Models in Mixed Domain Semi-Supervised Medical Image Segmentation
:star:code - DyCON: Dynamic Uncertainty-aware Consistency and Contrastive Learning for Semi-supervised Medical Image Segmentation
- A Semantic Knowledge Complementarity based Decoupling Framework for Semi-supervised Class-imbalanced Medical Image Segmentation
- EffiDec3D: An Optimized Decoder for High-Performance and Efficient 3D Medical Image Segmentation
- beta-FFT: Nonlinear Interpolation and Differentiated Training Strategies for Semi-Supervised Medical Image Segmentation
- Revisiting MAE Pre-training for 3D Medical Image Segmentation
- nnWNet: Rethinking the Use of Transformers in Biomedical Image Segmentation and Calling for a Unified Evaluation Benchmark
- Interactive Medical Image Segmentation: A Benchmark Dataset and Baseline
- Annotation Ambiguity Aware Semi-Supervised Medical Image Segmentation
- Test-Time Domain Generalization via Universe Learning: A Multi-Graph Matching Approach for Medical Image Segmentation
- Unified Medical Lesion Segmentation via Self-referring Indicator
- Boost the Inference with Co-training: A Depth-guided Mutual Learning Framework for Semi-supervised Medical Polyp Segmentation
- Minding Fuzzy Regions: A Data-driven Alternating Learning Paradigm for Stable Lesion Segmentation
- 医学图像分析
- Interactive Medical Image Analysis with Concept-based Similarity Reasoning
:star:code - Latent Drifting in Diffusion Models for Counterfactual Medical Image Synthesis
- Multi-modal Vision Pre-training for Medical Image Analysis
- dFLMoE: Decentralized Federated Learning via Mixture of Experts for Medical Data Analysis
- Noise-Consistent Siamese-Diffusion for Medical Image Synthesis and Segmentation
- Bringing CLIP to the Clinic: Dynamic Soft Labels and Negation-Aware Learning for Medical Analysis
- Interactive Medical Image Analysis with Concept-based Similarity Reasoning
- 医学图像重识别
- 医学VQA
- 3D医学
- Gene Expression Prediction(基因表达预测)
- 血管分割
- 放射学报告生成
- MR重建
12.Autonomous Driving(自动驾驶)
- CarPlanner: Consistent Auto-regressive Trajectory Planning for Large-scale Reinforcement Learning in Autonomous Driving
- Bridging Past and Future: End-to-End Autonomous Driving with Historical Prediction and Planning
- HiLoTs: High-Low Temporal Sensitive Representation Learning for Semi-Supervised LiDAR Segmentation in Autonomous Driving
:star:code - Hyperdimensional Uncertainty Quantification for Multimodal Uncertainty Fusion in Autonomous Vehicles Perception
- MPDrive: Improving Spatial Understanding with Marker-Based Prompt Learning for Autonomous Driving
- SOLVE: Synergy of Language-Vision and End-to-End Networks for Autonomous Driving
- SplatFlow: Self-Supervised Dynamic Gaussian Splatting in Neural Motion Flow Field for Autonomous Driving
- DiffusionDrive: Truncated Diffusion Model for End-to-End Autonomous Driving
- T2SG: Traffic Topology Scene Graph for Topology Reasoning in Autonomous Driving
- SplatAD: Real-Time Lidar and Camera Rendering with 3D Gaussian Splatting for Autonomous Driving
- DriveGPT4-V2: Harnessing Large Language Model Capabilities for Enhanced Closed-Loop Autonomous Driving
- Don't Shake the Wheel: Momentum-Aware Planning in End-to-End Autonomous Driving
- VisionPAD: A Vision-Centric Pre-training Paradigm for Autonomous Driving
- JiSAM: Alleviate Labeling Burden and Corner Case Problems in Autonomous Driving via Minimal Real-World Data
- SimLingo: Vision-Only Closed-Loop Autonomous Driving with Language-Action Alignment
- GoalFlow: Goal-Driven Flow Matching for Multimodal Trajectories Generation in End-to-End Autonomous Driving
- Distilling Multi-modal Large Language Models for Autonomous Driving
- Efficient Dynamic Scene Editing via 4D Gaussian-based Static-Dynamic Separation
- SceneCrafter: Controllable Multi-View Driving Scene Editing
- EvOcc: Accurate Semantic Occupancy for Automated Driving Using Evidence Theory
- 车辆重识别
- 车道线检测
- 轨迹预测
- Who Walks With You Matters: Perceiving Social Interactions with Groups for Pedestrian Trajectory Prediction
- MoFlow: One-Step Flow Matching for Human Trajectory Forecasting via Implicit Maximum Likelihood Estimation based Distillation
:star:code - Multi-modal Knowledge Distillation-based Human Trajectory Forecasting
:star:code - Physical Plausibility-aware Trajectory Prediction via Locomotion Embodiment
:star:code
:star:code - Trajectory Mamba: Efficient Attention-Mamba Forecasting Model Based on Selective SSM
- TraF-Align: Trajectory-aware Feature Alignment for Asynchronous Multi-agent Perception
- Leveraging SD Map to Augment HD Map-based Trajectory Prediction
- Certified Human Trajectory Prediction
- SocialMOIF: Multi-Order Intention Fusion for Pedestrian Trajectory Prediction
- Towards Generalizable Trajectory Prediction using Dual-Level Representation Learning and Adaptive Prompting
- Enduring, Efficient and Robust Trajectory Prediction Attack in Autonomous Driving via Optimization-Driven Multi-Frame Perturbation Framework
- Tra-MoE: Learning Trajectory Prediction Model from Multiple Domains for Adaptive Policy Conditioning
- Adapting to Observation Length of Trajectory Prediction via Contrastive Learning
- 3D占用预测
- 3D Occupancy Prediction with Low-Resolution Queries via Prototype-aware View Transformation
- GaussianFormer-2: Probabilistic Gaussian Superposition for Efficient 3D Occupancy Prediction
- GaussianWorld: Gaussian World Model for Streaming 3D Occupancy Prediction
- OccMamba: Semantic Occupancy Prediction with State Space Models
- Rethinking Temporal Fusion with a Unified Gradient Descent View for 3D Semantic Occupancy Prediction
- SDGOCC: Semantic and Depth-Guided Bird's-Eye View Transformation for 3D Multimodal Occupancy Prediction
11.Object Tracking(目标跟踪)
- SPMTrack: Spatio-Temporal Parameter-Efficient Fine-Tuning with Mixture of Experts for Scalable Visual Tracking
:star:code - Exploring Historical Information for RGBE Visual Tracking with Mamba
- Autoregressive Sequential Pretraining for Visual Tracking
- ACAttack: Adaptive Cross Attacking RGB-T Tracker via Multi-Modal Response Decoupling
- PURA: Parameter Update-Recovery Test-Time Adaption for RGB-T Tracking
- 目标跟踪
- MUST: The First Dataset and Unified Framework for Multispectral UAV Single Object Tracking
:star:code - MITracker: Multi-View Integration for Visual Object Tracking
- DreamTrack: Dreaming the Future for Multimodal Visual Object Tracking
- A Distractor-Aware Memory for Visual Object Tracking with SAM2
- HOT3D: Hand and Object Tracking in 3D from Egocentric Multi-View Videos
- MUST: The First Dataset and Unified Framework for Multispectral UAV Single Object Tracking
- 3D目标跟踪
- 多目标跟踪
- 点跟踪
10.Object Detection(目标检测)
- SparseAlign: A Fully Sparse Framework for Cooperative Object Detection
- MI-DETR: An Object Detection Model with Multi-time Inquiries Mechanism
- Test-Time Backdoor Detection for Object Detection Models
- BOOTPLACE: Bootstrapped Object Placement with Detection Transformers
:star:code
:star:code - Solving Instance Detection from an Open-World Perspective
- FIRE: Robust Detection of Diffusion-Generated Images via Frequency-Guided Reconstruction Error
- Mr. DETR: Instructive Multi-Route Training for Detection Transformers
- Can't Slow Me Down: Learning Robust and Hardware-Adaptive Object Detectors against Latency Attacks for Edge Devices
- Efficient Test-time Adaptive Object Detection via Sensitivity-Guided Pruning
- COUNTS: Benchmarking Object Detectors and Multimodal Large Language Models under Distribution Shifts
- Forensic Self-Descriptions Are All You Need for Zero-Shot Detection, Open-Set Source Attribution, and Clustering of AI-generated Images
- Learning to Detect Objects from Multi-Agent LiDAR Scans without Manual Labels
- Visual Consensus Prompting for Co-Salient Object Detection
:star:code - OW-OVD: Unified Open World and Open Vocabulary Object Detection
- PointSR: Self-Regularized Point Supervision for Drone-View Object Detection
- Learning Endogenous Attention for Incremental Object Detection
- Percept, Memory, and Imagine: World Feature Simulating for Open-Domain Unknown Object Detection
- Open-World Objectness Modeling Unifies Novel Object Detection
- ReRAW: RGB-to-RAW Image Reconstruction via Stratified Sampling for Efficient Object Detection on the Edge
- Efficient Event-Based Object Detection: A Hybrid Neural Network with Spatial and Temporal Attention
- Towards RAW Object Detection in Diverse Conditions
- Object Detection using Event Camera: A MoE Heat Conduction based Detector and A New Benchmark Dataset
- Style Evolving along Chain-of-Thought for Unknown-Domain Object Detection
- ROD-MLLM: Towards More Reliable Object Detection in Multimodal Large Language Models
- Revisiting Generative Replay for Class Incremental Object Detection
- Brain-Inspired Spiking Neural Networks for Energy-Efficient Object Detection
- Believing is Seeing: Unobserved Object Detection using Generative Models
- 3D目标检测
- Ev-3DOD: Pushing the Temporal Boundaries of 3D Object Detection with Event Cameras
:star:code - GBlobs: Explicit Local Structure via Gaussian Blobs for Improved Cross-Domain LiDAR-based 3D Object Detection
- UniMamba: Unified Spatial-Channel Representation Learning with Group-Efficient Mamba for LiDAR-based 3D Object Detection
- GO-N3RDet: Geometry Optimized NeRF-enhanced 3D Object Detector
:star:code - Uncertainty Meets Diversity: A Comprehensive Active Learning Framework for Indoor 3D Object Detection
- Learning Class Prototypes for Unified Sparse Supervised 3D Object Detection
:star:code - MonoPlace3D: Learning 3D-Aware Object Placement for 3D Monocular Detection
:star:code - RaCFormer: Towards High-Quality 3D Object Detection via Query-based Radar-Camera Fusion
- Beyond Human Perception: Understanding Multi-Object World from Monocular View
- MAD: Memory-Augmented Detection of 3D Objects
- FASTer: Focal token Acquiring-and-Scaling Transformer for Long-term 3D Objection Detection
- Leveraging Temporal Cues for Semi-Supervised Multi-View 3D Object Detection
- SP3D: Boosting Sparsely-Supervised 3D Object Detection via Accurate Cross-Modal Semantic Prompts
- Learning Class Prototypes for Unified Sparse-Supervised 3D Object Detection
- Cubify Anything: Scaling Indoor 3D Object Detection
- CorrBEV: Multi-View 3D Object Detection by Correlation Learning with Multi-modal Prototypes
- FSHNet: Fully Sparse Hybrid Network for 3D Object Detection
- V2X-R: Cooperative LiDAR-4D Radar Fusion with Denoising Diffusion for 3D Object Detection
- MonoDGP: Monocular 3D Object Detection with Decoupled-Query and Geometry-Error Priors
- RICCARDO: Radar Hit Prediction and Convolution for Camera-Radar 3D Object Detection
- ViKIENet: Towards Efficient 3D Object Detection with Virtual Key Instance Enhanced Network
- Ev-3DOD: Pushing the Temporal Boundaries of 3D Object Detection with Event Cameras
- 小目标检测
- 定向目标检测
- 长尾目标检测
- 伪装目标检测
- 显著目标检测
- 域适应目标检测
- 热成像目标检测
- 开放词汇目标检测
- 目标发现
- 属性识别
- 目标关键点
- X射线行李安检
- 场景变化检测
- 阴影检测
9.Image/Video Retrieval(图像检索)
- Learning Compatible Multi-Prize Subnetworks for Asymmetric Retrieval
- COBRA: COmBinatorial Retrieval Augmentation for Few-Shot Adaptation
- MaRI: Material Retrieval Integration across Domains
- Graph-Embedded Structure-Aware Perceptual Hashing for Neural Network Protection and Piracy Detection
- AutoSSVH: Exploring Automated Frame Sampling for Efficient Self-Supervised Video Hashing
- 图像检索
- 跨模态检索
- 视频-文本检索
- 文本-视频检索
- 组合图像检索
- Missing Target-Relevant Information Prediction with World Model for Accurate Zero-Shot Composed Image Retrieval
:star:code - CoLLM: A Large Language Model for Composed Image Retrieval
:star:code - Learning with Noisy Triplet Correspondence for Composed Image Retrieval
- Reason-before-Retrieve: One-Stage Reflective Chain-of-Thoughts for Training-Free Zero-Shot Composed Image Retrieval
- Generative Zero-Shot Composed Image Retrieval
- CCIN: Compositional Conflict Identification and Neutralization for Composed Image Retrieval
- Imagine and Seek: Improving Composed Image Retrieval with an Imagined Proxy
- ConText-CIR: Learning from Concepts in Text for Composed Image Retrieval
- Missing Target-Relevant Information Prediction with World Model for Accurate Zero-Shot Composed Image Retrieval
- NNS最近邻搜索
8.Image/Video Captions(图像字幕)
- Semantic and Expressive Variations in Image Captions Across Languages
- Diffusion Bridge: Leveraging Diffusion Model to Reduce the Modality Gap Between Text and Vision for Zero-Shot Image Captioning
- BIOMEDICA: An Open Biomedical Image-Caption Archive, Dataset, and Vision-Language Models Derived from Scientific Literature
- Benchmarking Large Vision-Language Models via Directed Scene Graph for Comprehensive Image Captioning
- BACON: Improving Clarity of Image Captions via Bag-of-Concept Graphs
- Patch Matters: Training-free Fine-grained Image Caption Enhancement via Local Perception
- Variance-Based Membership Inference Attacks Against Large-Scale Image Captioning Models
- FINECAPTION: Compositional Image Captioning Focusing on Wherever You Want at Any Granularity
- 视频字幕
7.Image/video Compression(图像/视频压缩)
- Sampling Innovation-Based Adaptive Compressive Sensing
:star:code - HUNet: Homotopy Unfolding Network for Image Compressive Sensing
- PVC: Progressive Visual Token Compression for Unified Image and Video Processing in Large Vision-Language Models
- 图像压缩
- Learned Image Compression with Dictionary-based Entropy Model
- Good, Cheap, and Fast: Overfitted Image Compression with Wasserstein Distortion
- Balanced Rate-Distortion Optimization in Learned Image Compression
- Taming Large Multimodal Agents for Ultra-low Bitrate Semantically Disentangled Image Compression
:star:code - MambaIC: State Space Models for High-Performance Learned Image Compression
:star:code - Using Powerful Prior Knowledge of Diffusion Model in Deep Unfolding Networks for Image Compressive Sensing
:star:code - PICD: Versatile Perceptual Image Compression with Diffusion Rendering
- Decouple Distortion from Perception: Region Adaptive Diffusion for Extreme-low Bitrate Perception Image Compression
- Bridging the Gap between Gaussian Diffusion Models and Universal Quantization for Image Compression
- Multirate Neural Image Compression with Adaptive Lattice Vector Quantization
- Linear Attention Modeling for Learned Image Compression
- Fitted Neural Lossless Image Compression
- Frequency-Biased Synergistic Design for Image Compression and Compensation
- Test-Time Fine-Tuning of Image Compression Models for Multi-Task Adaptability
- 视频压缩
- Towards Practical Real-Time Neural Video Compression
:star:code - Neural Video Compression with Context Modulation
:star:code - High Dynamic Range Video Compression: A Large-Scale Benchmark Dataset and A Learned Bit-depth Scalable Compression Algorithm
- FLAVC: Learned Video Compression with Feature Level Attention
- RL-RC-DoT: A Block-level RL agent for Task-Aware Video Compression
- Perceptual Video Compression with Neural Wrapping
- ECVC: Exploiting Non-Local Correlations in Multiple Frames for Contextual Video Compression
- Towards Practical Real-Time Neural Video Compression
6.Image Classification(图像分类)
- DVHGNN: Multi-Scale Dilated Vision HGNN for Efficient Vision Recognition
- Mamba-Adaptor: State Space Model Adaptor for Visual Recognition
- Lessons and Insights from a Unifying Study of Parameter-Efficient Fine-Tuning (PEFT) in Visual Recognition
- 5%>100%: Breaking Performance Shackles of Full Fine-Tuning on Visual Recognition Tasks
- Few-Shot Recognition via Stage-Wise Retrieval-Augmented Finetuning
- No Pains, More Gains: Recycling Sub-Salient Patches for Efficient High-Resolution Image Recognition
- EZSR: Event-based Zero-Shot Recognition
- DyFo: A Training-Free Dynamic Focus Visual Search for Enhancing LMMs in Fine-Grained Visual Understanding
:star:code - Learning from Neighbors: Category Extrapolation for Long-Tail Learning
- Text Augmented Correlation Transformer For Few-shot Classification & Segmentation
- Task-Specific Gradient Adaptation for Few-Shot One-Class Classification
- 图像分类
- End-to-End Implicit Neural Representations for Classification
:star:code - ProAPO: Progressively Automatic Prompt Optimization for Visual Classification
- Fast and Accurate Gigapixel Pathological Image Classification with Hierarchical Distillation Multi-Instance Learning
- STiL: Semi-supervised Tabular-Image Learning for Comprehensive Task-Relevant Information Exploration in Multimodal Classification
- Explaining Domain Shifts in Language: Concept erasing for Interpretable Image Classification
:star:code - Classifier-to-Bias: Toward Unsupervised Automatic Bias Detection for Visual Classifiers
- Interpretable Image Classification via Non-parametric Part Prototype Learning
- ProAPO: Progressively Automatic Prompt Optimization for Visual Classification
- Beyond Image Classification: A Video Benchmark and Dual-Branch Hybrid Discrimination Framework for Compositional Zero-Shot Learning
- End-to-End Implicit Neural Representations for Classification
- 多标签识别
- Recover and Match: Open-Vocabulary Multi-Label Recognition through Knowledge-Constrained Optimal Transport
:star:code - SPARC: Score Prompting and Adaptive Fusion for Zero-Shot Multi-Label Recognition in Vision-Language Models
- Classifier-guided CLIP Distillation for Unsupervised Multi-label Classification
:star:code
- Recover and Match: Open-Vocabulary Multi-Label Recognition through Knowledge-Constrained Optimal Transport
5.Image Super-Resolution(超分辨率)
- DifIISR: A Diffusion Model with Gradient Guidance for Infrared Image Super-Resolution
:star:code - CATANet: Efficient Content-Aware Token Aggregation for Lightweight Image Super-Resolution
- Volume Tells: Dual Cycle-Consistent Diffusion for 3D Fluorescence Microscopy De-noising and Super-Resolution
- The Power of Context: How Multimodality Improves Image Super-Resolution
:house:project - Uncertainty-guided Perturbation for Image Super-Resolution Diffusion Model
- Latent Space Super-Resolution for Higher-Resolution Image Generation with Diffusion Models
:star:code - FaithDiff: Unleashing Diffusion Priors for Faithful Image Super-resolution
- Progressive Focused Transformer for Single Image Super-Resolution
- ADD: Attribution-Driven Data Augmentation Framework for Boosting Image Super-Resolution
- Adversarial Diffusion Compression for Real-World Image Super-Resolution
- HIIF: Hierarchical Encoding based Implicit Image Function for Continuous Super-resolution
- DORNet: A Degradation Oriented and Regularized Network for Blind Depth Super-Resolution
- TSP-Mamba: The Travelling Salesman Problem Meets Mamba for Image Super-resolution and Beyond
- Adaptive Dropout: Unleashing Dropout across Layers for Generalizable Image Super-Resolution
- AutoLUT: LUT-Based Image Super-Resolution with Automatic Sampling and Adaptive Residual Learning
- Pixel-level and Semantic-level Adjustable Super-resolution: A Dual-LoRA Approach
- Auto-Encoded Supervision for Perceptual Image Super-Resolution
- TSD-SR: One-Step Diffusion with Target Score Distillation for Real-World Image Super-Resolution
- Edge-SD-SR: Low Latency and Parameter Efficient On-device Super-Resolution with Stable Diffusion via Bidirectional Conditioning
- Latent Space Super-Resolution for Higher-Resolution Image Generation with Diffusion Models
- PassionSR: Post-Training Quantization with Adaptive Scale in One-Step Diffusion based Image Super-Resolution
- Decoupling Fine Detail and Global Geometry for Compressed Depth Map Super-Resolution
- Arbitrary-steps Image Super-resolution via Diffusion Inversion
- Augmenting Perceptual Super-Resolution via Image Quality Predictors
- QMambaBSR: Burst Image Super-Resolution with Query State Space Model
- VSR
- BF-STVSR: B-Splines and Fourier---Best Friends for High Fidelity Spatial-Temporal Video Super-Resolution
- Efficient Video Super-Resolution for Real-time Rendering with Decoupled G-buffer Guidance
- EvEnhancer: Empowering Effectiveness, Efficiency and Generalizability for Continuous Space-Time Video Super-Resolution with Events
- PatchVSR: Breaking Video Diffusion Resolution Limits with Patch-wise Video Super-Resolution
- Event-based Video Super-Resolution via State Space Models
- Self-supervised ControlNet with Spatio-Temporal Mamba for Real-world Video Super-resolution
- Hazy Low-Quality Satellite Video Restoration Via Learning Optimal Joint Degradation Patterns and Continuous-Scale Super-Resolution Reconstruction
- VideoGigaGAN: Towards Detail-rich Video Super-Resolution
4.Image Progress(图像/视频处理)
- Segment Any-Quality Images with Generative Latent Space Enhancement
- AA: Adaptive Transformation Agent for Text-Guided Subject-Position Variable Background Inpainting
- MTADiffusion: Mask Text Alignment Diffusion Model for Object Inpainting
- 3D修复
- 图像增强
- 图像修复
- 图像恢复
- From Zero to Detail: Deconstructing Ultra-High-Definition Image Restoration from Progressive Spectral Perspective
:star:code - Making Old Film Great Again: Degradation-aware State Space Model for Old Film Restoration
- Erase Diffusion: Empowering Object Removal Through Calibrating Diffusion Pathways
- Acquire and then Adapt: Squeezing out Text-to-Image Model for Image Restoration
- MaIR: A Locality- and Continuity-Preserving Mamba for Image Restoration
- Navigating Image Restoration with VAR's Distribution Alignment Prior
- UHD-processer: Unified UHD Image Restoration with Progressive Frequency Learning and Degradation-aware Prompts
- A Universal Scale-Adaptive Deformable Transformer for Image Restoration across Diverse Artifacts
- Complexity Experts are Task-Discriminative Learners for Any Image Restoration
- A Regularization-Guided Equivariant Approach for Image Restoration
- Adapting Text-to-Image Generation with Feature Difference Instruction for Generic Image Restoration
- ACL: Activating Capability of Linear Attention for Image Restoration
- JarvisIR: Elevating Autonomous Driving Perception with Intelligent Image Restoration
- Reversing Flow for Image Restoration
- Dual Prompting Image Restoration with Diffusion Transformers
- UniRestore: Unified Perceptual and Task-Oriented Image Restoration Model Using Diffusion Prior
- VolFormer: Explore More Comprehensive Cube Interaction for Hyperspectral Image Restoration and Beyond
- 低光图像恢复
- 一体化图像恢复
- 零样本图像恢复
- From Zero to Detail: Deconstructing Ultra-High-Definition Image Restoration from Progressive Spectral Perspective
- 去水印
- 去雾
- Iterative Predictor-Critic Code Decoding for Real-World Image Dehazing
- Learning Hazing to Dehazing: Towards Realistic Haze Generation for Real-World Image Dehazing
:star:code - Tokenize Image Patches: Global Context Fusion for Effective Haze Removal in Large Images
- CoA: Towards Real Image Dehazing via Compression-and-Adaptation
- 去噪
- BEVDiffuser: Plug-and-Play Diffusion Model for BEV Denoising with Ground-Truth Guidance
- Denoising Functional Maps: Diffusion Models for Shape Correspondence
:star:code - RaSS: Improving Denoising Diffusion Samplers with Reinforced Active Sampling Scheduler
- Optimizing for the Shortest Path in Denoising Diffusion Model
:star:code - DnLUT: Ultra-Efficient Color Image Denoising via Channel-Aware Lookup Tables
:star:code - Rotation-Equivariant Self-Supervised Method in Image Denoising
- Zero-Shot Blind-spot Image Denoising via Implicit Neural Sampling
- Positive2Negative: Breaking the Information-Lossy Barrier in Self-Supervised Single Image Denoising
- Rethinking Reconstruction and Denoising in the Dark: New Perspective, General Architecture and Beyond
- All-Optical Nonlinear Diffractive Deep Network for Ultrafast Image Denoising
- Prior Does Matter: Visual Navigation via Denoising Diffusion Bridge Models
- Complementary Advantages: Exploiting Cross-Field Frequency Correlation for NIR-Assisted Image Denoising
- Noise Modeling in One Hour: Minimizing Preparation Efforts for Self-supervised Low-Light RAW Image Denoising
- 去雨
- 去雪
- 去马赛克
- 去模糊
- DiET-GS: Diffusion Prior and Event Stream-Assisted Motion Deblurring 3D Gaussian Splatting
:star:code - Parameterized Blur Kernel Prior Learning for Local Motion Deblurring
- DynaMoDe-NeRF: Motion-aware Deblurring Neural Radiance Field for Dynamic Scenes
- Exploiting Deblurring Networks for Radiance Fields
- A Polarization-Aided Transformer for Image Deblurring via Motion Vector Decomposition
- Diffusion-based Event Generation for High-Quality Image Deblurring
- Efficient Visual State Space Model for Image Deblurring
- Gyro-based Neural Single Image Deblurring
- DiET-GS: Diffusion Prior and Event Stream-Assisted Motion Deblurring 3D Gaussian Splatting
- 图像质量
- Toward Generalized Image Quality Assessment: Relaxing the Perfect Reference Quality Assumption
:star:code - Exploring Semantic Feature Discrimination for Perceptual Image Super-Resolution and Opinion-Unaware No-Reference Image Quality Assessment
- Image Quality Assessment: Investigating Causal Perceptual Effects with Abductive Counterfactual Inference
- Distilling Spatially-Heterogeneous Distortion Perception for Blind Image Quality Assessment
- Image Quality Assessment: From Human to Machine Preference
- Teaching Large Language Models to Regress Accurate Image Quality Scores Using Score Distribution
- Toward Generalized Image Quality Assessment: Relaxing the Perfect Reference Quality Assumption
- 视频增强
- 视频去雨
- 视频去噪
- 视频修复
- 视频质量评估
- 去反射
- 去阴影
- 去高光
- 去恶劣天气
- 视频扩图
3.Image Segmentation(图像分割)
- Your ViT is Secretly an Image Segmentation Model
:house:project - CoMBO: Conflict Mitigation via Branched Optimization for Class Incremental Segmentation
:star:code - DocSAM: Unified Document Image Segmentation via Query Decomposition and Heterogeneous Mixed Learning
:star:code - FisherTune: Fisher-Guided Robust Tuning of Vision Foundation Models for Domain Generalized Segmentation
- Scaling up Image Segmentation across Data and Tasks
- Prototype-Based Image Prompting for Weakly Supervised Histopathological Image Segmentation
- The Impact Label Noise and Choice of Threshold has on Cross-Entropy and Soft-Dice in Image Segmentation
- Rethinking Query-based Transformer for Continual Image Segmentation
- EntityErasure: Erasing Entity Cleanly via Amodal Entity Segmentation and Completion
- UNICL-SAM: Uncertainty-Driven In-Context Segmentation with Part Prototype Discovery
- Towards Explicit Geometry-Reflectance Collaboration for Generalized LiDAR Segmentation in Adverse Weather
- Repurposing Stable Diffusion Attention for Training-Free Unsupervised Interactive Segmentation
- NTClick: Achieving Precise Interactive Segmentation With Noise-tolerant Clicks
- Mask-Adapter: The Devil is in the Masks for Open-Vocabulary Segmentation
- Soft Self-labeling and Potts Relaxations for Weakly-supervised Segmentation
- Boosting the Dual-Stream Architecture in Ultra-High Resolution Segmentation with Resolution-Biased Uncertainty Estimation
- Towards Continual Universal Segmentation
- 3D分割
- OnlineAnySeg: Online Zero-Shot 3D Segmentation by Visual Foundation Model Guided 2D Mask Merging
- Mosaic3D: Foundation Dataset and Model for Open-Vocabulary 3D Segmentation
- D^3CTTA: Domain-Dependent Decorrelation for Continual Test-Time Adaption of 3D LiDAR Segmentation
- 3D Dental Model Segmentation with Geometrical Boundary Preserving
- 3D-AVS: LiDAR-based 3D Auto-Vocabulary Segmentation
- 指代图像分割
- 小样本分割
- 语义协同分割
- 语义分割
- A Dataset for Semantic Segmentation in the Presence of Unknowns
- DFormerv2: Geometry Self-Attention for RGBD Semantic Segmentation
:star:code - No Thing, Nothing: Highlighting Safety-Critical Classes for Robust LiDAR Semantic Segmentation in Adverse Weather
- Generative Map Priors for Collaborative BEV Semantic Segmentation
- Keep the Balance: A Parameter-Efficient Symmetrical Framework for RGB+X Semantic Segmentation
- SemiDAViL: Semi-supervised Domain Adaptation with Vision-Language Guidance for Semantic Segmentation
:star:code - Golden Cudgel Network for Real-Time Semantic Segmentation
- Beyond Background Shift: Rethinking Instance Replay in Continual Semantic Segmentation
- MaSS13K: A Matting-level Semantic Segmentation Benchmark
- Convex Combination Star Shape Prior for Data-driven Image Semantic Segmentation
- SegMAN: Omni-scale Context Modeling with State Space Models and Local Attention for Semantic Segmentation
- SUM Parts: Benchmarking Part-Level Semantic Segmentation of Urban Meshes
- 开放词汇语义分割
- DPSeg: Dual-Prompt Cost Volume Learning for Open-Vocabulary Semantic Segmentation
- Semantic Library Adaptation: LoRA Retrieval and Fusion for Open-Vocabulary Semantic Segmentation
:star:code
:house:project - LPOSS: Label Propagation Over Patches and Pixels for Open-vocabulary Semantic Segmentation
- Exploring Simple Open-Vocabulary Semantic Segmentation
- Dual Semantic Guidance for Open Vocabulary Semantic Segmentation
- Understanding Fine-tuning CLIP for Open-vocabulary Semantic Segmentation in Hyperbolic Space
- Distilling Spectral Graph for Object-Context Aware Open-Vocabulary Semantic Segmentation
- Parameter-efficient Fine-tuning in Hyperspherical Space for Open-vocabulary Semantic Segmentation
- 弱监督语义分割
- Exploring CLIP's Dense Knowledge for Weakly Supervised Semantic Segmentation
- Multi-Label Prototype Visual Spatial Search for Weakly Supervised Semantic Segmentation
- Exact: Exploring Space-Time Perceptive Clues for Weakly Supervised Satellite Image Time Series Semantic Segmentation
- Weakly Supervised Semantic Segmentation via Progressive Confidence Region Expansion
- POT: Prototypical Optimal Transport for Weakly Supervised Semantic Segmentation
- FFR: Frequency Feature Rectification for Weakly Supervised Semantic Segmentation
- 半监督语义分割
- 域适应语义分割
- 域泛化语义分割
- 3D语义分割
- 全景分割
- 实例分割
- 场景分割
- 裂纹分割
- 动作分割
- VIS
- VSS
- VOS
- 抠图
- 部分分割
- 视频分割
- GLUS: Global-Local Reasoning Unified into A Single Large Language Model for Video Segmentation
- SAM-I2V: Upgrading SAM to Support Promptable Video Segmentation with Less than 0.2% Training Cost
- SAMWISE: Infusing Wisdom in SAM2 for Text-Driven Video Segmentation
- Decoupled Motion Expression Video Segmentation
2.Face(人脸)
- Zero-Shot Head Swapping in Real-World Scenarios
- S^3-Face: SSS-Compliant Facial Reflectance Estimation via Diffusion Priors
- Adv-CPG: A Customized Portrait Generation Framework with Facial Adversarial Attacks
- FreeUV: Ground-Truth-Free Realistic Facial UV Texture Recovery via Cross-Assembly Inference Strategy
:star:code - FFaceNeRF: Few-shot Face Editing in Neural Radiance Fields
:star:code - Enhancing Facial Privacy Protection via Weakening Diffusion Purification
- Towards More General Video-based Deepfake Detection through Facial Component Guided Adaptation for Foundation Model
- AVF-MAE++: Scaling Affective Video Facial Masked Autoencoders via Efficient Audio-Visual Self-Supervised Learning
- From Faces to Voices: Learning Hierarchical Representations for High-quality Video-to-Speech
:house:project - Learning Person-Specific Animatable Face Models from In-the-Wild Images via a Shared Base Model
- ControlFace: Harnessing Facial Parametric Control for Face Rigging
- PersonaHOI: Effortlessly Improving Face Personalization in Human-Object Interaction Generation
- FSFM: A Generalizable Face Security Foundation Model via Self-Supervised Facial Representation Learning
- GroundingFace: Fine-grained Face Understanding via Pixel Grounding Multimodal Large Language Model
- SFDM: Robust Decomposition of Geometry and Reflectance for Realistic Face Rendering from Sparse-view Images
- 人脸恢复
- 人脸识别
- UMFN: Unified Multi-Domain Face Normalization for Joint Cross-domain Prototype Learning and Heterogeneous Face Recognition
- CryptoFace: End-to-End Encrypted Face Recognition
- Towards a Universal Synthetic Video Detector: From Face or Background Manipulations to Fully AI-Generated Content
- Rethinking Vision-Language Model in Face Forensics: Multi-Modal Interpretable Forged Face Detector
- Improving the Transferability of Adversarial Attacks on Face Recognition with Diverse Parameters Augmentation
- GIF: Generative Inspiration for Face Recognition at Scale
- ProjAttacker: A Configurable Physical Adversarial Attack for Face Recognition via Projector
- Data Synthesis with Diverse Styles for Face Recognition via 3DMM-Guided Diffusion
- 人脸表情识别
- 人脸活体检测
- 假脸识别/检测
- 人脸关键点
- 说话头
- InsTaG: Learning Personalized 3D Talking Head from Few-Second Video
:star:code - Monocular and Generalizable Gaussian Talking Head Animation
- IF-MDM: Implicit Face Motion Diffusion Model for High-Fidelity Realtime Talking Head Generation
:house:project - Diffusion-based Realistic Listening Head Generation via Hybrid Motion Modeling
- IM-Portrait: Learning 3D-aware Video Diffusion for Photorealistic Talking Heads from Monocular VideosC
- Perceptually Accurate 3D Talking Head Generation: New Definitions, Speech-Mesh Representation, and Evaluation Metrics
- DualTalk: Dual-Speaker Interaction for 3D Talking Head Conversations
- Synergizing Motion and Appearance: Multi-Scale Compensatory Codebooks for Talking Head Video Generation
- EmotiveTalk: Expressive Talking Head Generation through Audio Information Decoupling and Emotional Video Diffusion
- Silence is Golden: Leveraging Adversarial Examples to Nullify Audio Control in LDM-based Talking-Head Generation
- LLM-driven Multimodal and Multi-Identity Listening Head Generation
- INFP: Audio-Driven Interactive Head Generation in Dyadic Conversations
- InsTaG: Learning Personalized 3D Talking Head from Few-Second Video
- 情感识别
- CocoER: Aligning Multi-Level Feature by Competition and Coordination for Emotion Recognition
- Knowledge-Aligned Counterfactual-Enhancement Diffusion Perception for Unsupervised Cross-Domain Visual Emotion Recognition
- Seek Common Ground While Reserving Differences: Semi-Supervised Image-Text Sentiment Recognition
- Uncertain Multimodal Intention and Emotion Understanding in the Wild
- 人脸运动生成
- 微表情识别
- 肖像
- Lux Post Facto: Learning Portrait Performance Relighting with Conditional Video Diffusion and a Hybrid Dataset
- HiFi-Portrait: Zero-shot Identity-preserved Portrait Generation with High-fidelity Multi-face Fusion
- SynthLight: Portrait Relighting with Diffusion Model by Learning to Re-render Synthetic Faces
- HyperLoRA: Parameter-Efficient Adaptive Generation for Portrait Synthesis
- Coherent 3D Portrait Video Reconstruction via Triplane Fusion
- DiffPortrait360: Consistent Portrait Diffusion for 360 View Synthesis
1.Othere(其它)
- Hyperbolic Category Discovery
- PRaDA: Projective Radial Distortion Averaging
- One-Step Event-Driven High-Speed Autofocus
- Color Alignment in Diffusion
- LatexBlend: Scaling Multi-concept Customized Generation with Latent Textual Blending
- Effortless Active Labeling for Long-Term Test-Time Adaptation
:star:code - EventFly: Event Camera Perception from Ground to the Sky
:star:code - PartRM: Modeling Part-Level Dynamics with Large Cross-State Reconstruction Model
:house:project - Attention IoU: Examining Biases in CelebA using Attention Maps
:star:code - Resilient Sensor Fusion under Adverse Sensor Failures via Multi-Modal Expert Fusion
- Interpretable Generative Models through Post-hoc Concept Bottlenecks
:star:code - Stop Walking in Circles! Bailing Out Early in Projected Gradient Descent
- Color Conditional Generation with Sliced Wasserstein Guidance
:star:code - Prosody-Enhanced Acoustic Pre-training and Acoustic-Disentangled Prosody Adapting for Movie Dubbing
:star:code - Exploring Contextual Attribute Density in Referring Expression Counting
- Scale Efficient Training for Large Datasets
:star:code - Learning from Streaming Video with Orthogonal Gradients
- Mesh Mamba: A Unified State Space Model for Saliency Prediction in Non-Textured and Textured Meshes
- SCAP: Transductive Test-Time Adaptation via Supportive Clique-based Attribute Prompting
:star:code - RePerformer: Immersive Human-centric Volumetric Videos from Playback to Photoreal Reperformance
:star:code - Learning Extremely High Density Crowds as Active Matters
- Transformers without Normalization
:star:code - UniGoal: Towards Universal Zero-shot Goal-oriented Navigation
- MetricGrids: Arbitrary Nonlinear Approximation with Elementary Metric Grids based Implicit Neural Representation
:star:code - Filter Images First, Generate Instructions Later: Pre-Instruction Data Selection for Visual Instruction Tuning
:star:code - Data-free Universal Adversarial Perturbation with Pseudo-semantic Prior
- Entropy Bootstrapping for Weakly Supervised Nuclei Detection
- EgoLife: Towards Egocentric Life Assistant
:star:code
:star:code - Q-PART: Quasi-Periodic Adaptive Regression with Test-time Training for Pediatric Left Ventricular Ejection Fraction Regression
- STAA-SNN: Spatial-Temporal Attention Aggregator for Spiking Neural Networks
- Voxel-Aggergated Feature Synthesis: Efficient Dense Mapping for Simulated 3D Reasoning
- LoyalDiffusion: A Diffusion Model Guarding Against Data Replication
- Do computer vision foundation models learn the low-level characteristics of the human visual system?
- VDial: Unification of Video and Visual Dialog via Multimodal Experts
- ArcPro: Architectural Programs for Structured 3D Abstraction of Sparse Points
:house:project - h-Edit: Effective and Flexible Diffusion-Based Editing via Doob's h-Transform
:star:code - Q-Eval-100K: Evaluating Visual Quality and Alignment Level for Text-to-Vision Content
:star:code - PIDLoc: Cross-View Pose Optimization Network Inspired by PID Controllers
- OverLoCK: An Overview-first-Look-Closely-next ConvNet with Context-Mixing Dynamic Kernels
:star:code - ReCon: Enhancing True Correspondence Discrimination through Relation Consistency for Robust Noisy Correspondence Learning
:star:code - CLIP Under the Microscope: A Fine-Grained Analysis of Multi-Object Representation
:star:code - Finding Local Diffusion Schrödinger Bridge using Kolmogorov-Arnold Network
:star:code - Rethinking Epistemic and Aleatoric Uncertainty for Active Open-Set Annotation: An Energy-Based Approach
:star:code - Knowledge Bridger: Towards Training-free Missing Multi-modality Completion
- STPro: Spatial and Temporal Progressive Learning for Weakly Supervised Spatio-Temporal Grounding
- Towards Improved Text-Aligned Codebook Learning: Multi-Hierarchical Codebook-Text Alignment with Long Text
- GenVDM: Generating Vector Displacement Maps From a Single Image
- WeGen: A Unified Model for Interactive Multimodal Generation as We Chat
:star:code - SpiritSight Agent: Advanced GUI Agent with One Look
:house:project - DoraCycle: Domain-Oriented Adaptation of Unified Generative Model in Multimodal Cycles
:star:code - Do ImageNet-trained models learn shortcuts? The impact of frequency shortcuts on generalization
- CoSDH: Communication-Efficient Collaborative Perception via Supply-Demand Awareness and Intermediate-Late Hybridization
:star:code - Full-DoF Egomotion Estimation for Event Cameras Using Geometric Solvers
:star:code - AIM-Fair: Advancing Algorithmic Fairness via Selectively Fine-Tuning Biased Models with Contextual Synthetic Data
:star:code
:star:code - Escaping Plato's Cave: Towards the Alignment of 3D and Text Latent Spaces
- DecoupledGaussian: Object-Scene Decoupling for Physics-Based Interaction
:star:code - Reasoning in visual navigation of end-to-end trained agents: a dynamical systems approach
- ObjectMover: Generative Object Movement with Video Prior
:star:code - RayFlow: Instance-Aware Diffusion Acceleration via Adaptive Flow Trajectories
- Robust Multimodal Survival Prediction with the Latent Differentiation Conditional Variational AutoEncoder
- DAMM-Diffusion: Learning Divergence-Aware Multi-Modal Diffusion Model for Nanoparticles Distribution Prediction
- Project-Probe-Aggregate: Efficient Fine-Tuning for Group Robustness
- Training Data Provenance Verification: Did Your Model Use Synthetic Data from My Generative Model for Training?
:star:code - DUNE: Distilling a Universal Encoder from Heterogeneous 2D and 3D Teachers
:house:project - RoGSplat: Learning Robust Generalizable Human Gaussian Splatting from Sparse Multi-View Images
:star:code - DropGaussian: Structural Regularization for Sparse-view Gaussian Splatting
:star:code - Learning from Synchronization: Self-Supervised Uncalibrated Multi-View Person Association in Challenging Scenes
:star:code - CoE: Chain-of-Explanation via Automatic Visual Concept Circuit Description and Polysemanticity Quantification
- VerbDiff: Text-Only Diffusion Models with Enhanced Interaction Awareness
:star:code - BlockDance: Reuse Structurally Similar Spatio-Temporal Features to Accelerate Diffusion Transformers
- MASH-VLM: Mitigating Action-Scene Hallucination in Video-LLMs through Disentangled Spatial-Temporal Representations
- Embedding Shift Dissection on CLIP: Effects of Augmentations on VLM's Representation Learning
- Seeing What Matters: Empowering CLIP with Patch Generation-to-Selection
- Jailbreaking the Non-Transferable Barrier via Test-Time Data Disguising
:star:code - A New Statistical Model of Star Speckles for Learning to Detect and Characterize Exoplanets in Direct Imaging Observations
- UniSTD: Towards Unified Spatio-Temporal Learning across Diverse Disciplines
:star:code - SURGEON: Memory-Adaptive Fully Test-Time Adaptation via Dynamic Activation Sparsity
- Attribute-formed Class-specific Concept Space: Endowing Language Bottleneck Model with Better Interpretability and Scalability
:star:code - RoomTour3D: Geometry-Aware Video-Instruction Tuning for Embodied Navigation
- Navigation World Models
- CityWalker: Learning Embodied Urban Navigation from Web-Scale Videos
- Scene Map-based Prompt Tuning for Navigation Instruction Generation
- Faster Parameter-Efficient Tuning with Token Redundancy Reduction
- SyncSDE: A Probabilistic Framework for Diffusion Synchronization
- Test-Time Visual In-Context Tuning
:star:code - LOCORE: Image Re-ranking with Long-Context Sequence Modeling
:star:code - CTRL-O: Language-Controllable Object-Centric Visual Representation Learning
:house:project - BioX-CPath: Biologically-driven Explainable Diagnostics for Multistain IHC Computational Pathology
:star:code - Scenario Dreamer: Vectorized Latent Diffusion for Generating Driving Simulation Environments
- Locally Orderless Images for Optimization in Differentiable Rendering
:star:code - ManipTrans: Efficient Dexterous Bimanual Manipulation Transfer via Residual Learning
- MVSAnywhere: Zero-Shot Multi-View Stereo
- LSNet: See Large, Focus Small
:star:code - Enhancing Creative Generation on Stable Diffusion-based Models
- Decoupled Distillation to Erase: A General Unlearning Method for Any Class-centric Tasks
- COSMIC: Clique-Oriented Semantic Multi-space Integration for Robust CLIP Test-Time Adaptation
- MultiMorph: On-demand Atlas Construction
- POPEN: Preference-Based Optimization and Ensemble for LVLM-Based Reasoning Segmentation
:house:project - CADCrafter: Generating Computer-Aided Design Models from Unconstrained Images
- Two is Better than One: Efficient Ensemble Defense for Robust and Compact Models
- DefMamba: Deformable Visual State Space Model
- Few-shot Personalized Scanpath Prediction
:star:code - Mind the Trojan Horse: Image Prompt Adapter Enabling Scalable and Deceptive Jailbreaking
:star:code - Seurat: From Moving Points to Depth
:star:code - CheXWorld: Exploring Image World Modeling for Radiograph Representation Learning
:star:code - Generative Multimodal Pretraining with Discrete Diffusion Timestep Tokens
:star:code - GeoMM: On Geodesic Perspective for Multi-modal Learning
- Scalable Video-to-Dataset Generation for Cross-Platform Mobile Agents
- DeCafNet: Delegate and Conquer for Efficient Temporal Grounding in Long Videos
:star:code - UniPhy: Learning a Unified Constitutive Model for Inverse Physics Simulation
- Sample- and Parameter-Efficient Auto-Regressive Image Models
- UltraFusion: Ultra High Dynamic Imaging using Exposure Fusion
- Doppelgangers and Adversarial Vulnerability
- Finer-CAM: Spotting the Difference Reveals Finer Details for Visual Explanation
- OpenSDI: Spotting Diffusion-Generated Images in the Open World
- Non-Natural Image Understanding with Advancing Frequency-based Vision Encoders
- SynerGen-VL: Towards Synergistic Image Understanding and Generation with Vision Experts and Token Folding
- GarmentPile: Point-Level Visual Affordance Guided Retrieval and Adaptation for Cluttered Garments Manipulation
- LamRA: Large Multimodal Model as Your Advanced Retrieval Assistant
- IncEventGS: Pose-Free Gaussian Splatting from a Single Event Camera
- Improving Gaussian Splatting with Localized Points Management
- Hardware-Rasterized Ray-Based Gaussian Splatting
- FlexDrive: Toward Trajectory Flexibility in Driving Scene Gaussian Splatting Reconstruction and Rendering
- Mani-GS: Gaussian Splatting Manipulation with Triangular Mesh
- PanSplat: 4K Panorama Synthesis with Feed-Forward Gaussian Splatting
- 3DGUT: Enabling Distorted Cameras and Secondary Rays in Gaussian Splatting
- VoxelSplat: Dynamic Gaussian Splatting as an Effective Loss for Occupancy and Flow Prediction
- IRGS: Inter-Reflective Gaussian Splatting with 2D Gaussian Ray Tracing
- CoCoGaussian: Leveraging Circle of Confusion for Gaussian Splatting from Defocused Images
- Feat2GS: Probing Visual Foundation Models with Gaussian Splatting
- DeSplat: Decomposed Gaussian Splatting for Distractor-Free Rendering
- RainyGS: Efficient Rain Synthesis with Physically-Based Gaussian Splatting
- USP-Gaussian: Unifying Spike-based Image Reconstruction, Pose Correction and Gaussian Splatting
- DepthSplat: Connecting Gaussian Splatting and Depth
- Splatter-360: Generalizable 360 Gaussian Splatting for Wide-baseline Panoramic Images
- Is this Generated Person Existed in Real-world? Fine-grained Detecting and Calibrating Abnormal Human-body
- R2C: Mapping Room to Chessboard to Unlock LLM As Low-Level Action Planner
- TAET: Two-Stage Adversarial Equalization Training on Long-Tailed Distributions
- From Head to Tail: Efficient Black-box Model Inversion Attack via Long-tailed Learning
- Generalized Zero-Shot Classification via Semantics-Free Inter-Class Feature Generation
- CLOC: Contrastive Learning for Ordinal Classification with Multi-Margin N-pair Loss
- Enhancing Testing-Time Robustness for Trusted Multi-View Classification in the Wild
- Efficient ANN-Guided Distillation: Aligning Rate-based Features of Spiking Neural Networks through Hybrid Block-wise Replacement
- VISTREAM: Improving Computation Efficiency of Visual Streaming Perception via Law-of-Charge-Conservation Inspired Spiking Neural Network
- IDEA-Bench: How Far are Generative Models from Professional Designing?
- PhD: A ChatGPT-Prompted Visual Hallucination Evaluation Dataset
- Free on the Fly: Enhancing Flexibility in Test-Time Adaptation with Online EM
- Inference-Scale Complexity in ANN-SNN Conversion for High-Performance and Low-Power Applications
- Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation
- MagicArticulate: Make Your 3D Models Articulation-Ready
- Gain from Neighbors: Boosting Model Robustness in the Wild via Adversarial Perturbations Toward Neighboring Classes
- Enhancing Video-LLM Reasoning via Agent-of-Thoughts Distillation
- ReCapture: Generative Video Camera Controls for User-Provided Videos using Masked Video Fine-Tuning
- ReCap: Better Gaussian Relighting with Cross-Environment Captures
- MambaOut: Do We Really Need Mamba for Vision?
- Everything to the Synthetic: Diffusion-driven Test-time Adaptation via Synthetic-Domain Alignment
- AuraFusion360: Augmented Unseen Region Alignment for Reference-based 360deg Unbounded Scene Inpainting
- Language-Guided Image Tokenization for Generation
- ProReflow: Progressive Reflow with Decomposed Velocity
- D^3-Human: Dynamic Disentangled Digital Human from Monocular Video
- BADGR: Bundle Adjustment Diffusion Conditioned by Gradients for Wide-Baseline Floor Plan Reconstruction
- TANGO: Training-free Embodied AI Agents for Open-world Tasks
- Nested Diffusion Models Using Hierarchical Latent Priors
- A Theory of Learning Unified Model via Knowledge Integration from Label Space Varying Domains
- Spiking Transformer with Spatial-Temporal Attention
- SymDPO: Boosting In-Context Learning of Large Multimodal Models with Symbol Demonstration Direct Preference Optimization
- Rethinking Decoder Design: Improving Biomarker Segmentation Using Depth-to-Space Restoration and Residual Linear Attention
- MaDCoW: Marginal Distortion Correction for Wide-Angle Photography with Arbitrary Objects
- SynTab-LLaVA: Enhancing Multimodal Table Understanding with Decoupled Synthesis
- Edit Away and My Face Will not Stay: Personal Biometric Defense against Malicious Generative Editing
- Improving Accuracy and Calibration via Differentiated Deep Mutual Learning
- Buffer Anytime: Zero-Shot Video Depth and Normal from Image Priors
- Event Ellipsometer: Event-based Mueller-Matrix Video Imaging
- Hiding Images in Diffusion Models by Editing Learned Score Functions
- Tightening Robustness Verification of MaxPool-based Neural Networks via Minimizing the Over-Approximation Zone
- PhysicsGen: Can Generative Models Learn from Images to Predict Complex Physical Relations?
- Improve Representation for Imbalanced Regression through Geometric Constraints
- Latent Space Imaging
- D2SP: Dynamic Dual-Stage Purification Framework for Dual Noise Mitigation in Vision-based Affective Recognition.
- LaVin-DiT: Large Vision Diffusion Transformer
- DiffFNO: Diffusion Fourier Neural Operator
- Aesthetic Post-Training Diffusion Models from Generic Preferences with Step-by-step Preference Optimization
- DiSciPLE: Learning Interpretable Programs for Scientific Visual Discovery
- LookingGlass: Generative Anamorphoses via Laplacian Pyramid Warping
- ShowMak3r: Compositional TV Show Reconstruction
- A Unified, Resilient, and Explainable Adversarial Patch Detector
- Disentangling Safe and Unsafe Image Corruptions via Anisotropy and Locality
- SphereUFormer: A U-Shaped Transformer for Spherical 360 Perception
- StyleMaster: Stylize Your Video with Artistic Generation and Translation
- Unsupervised Continual Domain Shift Learning with Multi-Prototype Modeling
- Open-Canopy: Towards Very High Resolution Forest Monitoring
- Stretching Each Dollar: Diffusion Training from Scratch on a Micro-Budget
- Kiss3DGen: Repurposing Image Diffusion Models for 3D Asset Generation
- DiTASK: Multi-Task Fine-Tuning with Diffeomorphic Transformations
- Improving Diffusion Inverse Problem Solving with Decoupled Noise Annealing
- VASparse: Towards Efficient Visual Hallucination Mitigation via Visual-Aware Token Sparsification
- Instruction-based Image Manipulation by Watching How Things Move
- Spatiotemporal Skip Guidance for Enhanced Video Diffusion Sampling
- SAR3D: Autoregressive 3D Object Generation and Understanding via Multi-scale 3D VQVAE
- Self-Supervised Learning for Color Spike Camera Reconstruction
- [From Elements to Design: A Layered Approach for Automatic Graphic Design Composition](https://openaccess.thecvf.com/content/CVPR2025/html/Lin_From_Ele