ICCV-2025-Papers
November 24, 2025 · View on GitHub
会议时间:2025年10月19日至23日
会议网址:https://iccv.thecvf.com/
查看2025年综述文献点这里↘️2025-CV-Surveys
2025 年论文分类汇总戳这里
↘️WACV-2025-Papers ↘️CVPR-2025-Papers ↘️ICCV-2025-Papers
2024 年论文分类汇总戳这里
↘️WACV-2024-Papers ↘️CVPR-2024-Papers ↘️ECCV-2024-Papers
2023 年论文分类汇总戳这里
2022 年论文分类汇总戳这里
2021 年论文分类汇总戳这里
2020 年论文分类汇总戳这里
已全部分类完
🏆最佳论文
- Generating Physically Stable and Buildable Brick Structures from Text
:house:project :house:project - ICCV 2025 最佳论文公布!卡内基梅隆大学提出BrickGPT:文本生成实体积木,还能保证搭得稳!
目录
54.计算成像
- IM360 Large-scale Indoor Mapping with 360 Cameras
- Multispectral Demosaicing via Dual Cameras
- Processing and acquisition traces in visual encoders What does CLIP know about your camera
:star:code - Single-Scanline Relative Pose Estimation for Rolling Shutter Cameras
- Estimating 2D Camera Motion with Hybrid Motion Basis
:star:code
:star:code - Image as an IMU Estimating Camera Motion from a Single Motion-Blurred Image
- AlignDiff Learning Physically-Grounded Camera Alignment via Diffusion
- TrajectoryCrafter Redirecting Camera Trajectory for Monocular Videos via Diffusion Models
- Super Resolved Imaging with Adaptive Optics
:house:project - HccePose(BF) Predicting Front Back Surfaces to Construct Ultra-Dense 2D-3D Correspondences for Pose Estimation
- RePoseD Efficient Relative Pose Estimation With Known Depth Information
:star:code - Scaling 3D Compositional Models for Robust Classification and Pose Estimation
- DRaM-LHM A Quaternion Framework for Iterative Camera Pose Estimation
- Epipolar Consistent Attention Aggregation Network for Unsupervised Light Field Disparity Estimation
- TESPEC Temporally-Enhanced Self-Supervised Pretraining for Event Cameras
:house:project - Simultaneous Motion And Noise Estimation with Event Cameras
:star:code :house:project - EventUPS Uncalibrated Photometric Stereo Using an Event Camera
- GenDoP Auto-regressive Camera Trajectory Generation as a Director of Photography
:house:project - Inverse Image-Based Rendering for Light Field Generation from Single Images
- Princeton365 A Diverse Dataset with Accurate Camera Pose
- CF3 Compact and Fast 3D Feature Fields
- CCMNet Leveraging Calibrated Color Correction Matrices for Cross-Camera Color Constancy
53.Dense Prediction
- Frequency-Dynamic Attention Modulation for Dense Prediction
:star:code - FreeDNA: Endowing Domain Adaptation of Diffusion-Based Dense Prediction with Training-Free Domain Noise Alignment
:star:code - ATAS Any-to-Any Self-Distillation for Enhanced Open-Vocabulary Dense Prediction
- Unbiased Region-Language Alignment for Open-Vocabulary Dense Prediction
:star:code - Enhancing Mamba Decoder with Bidirectional Interaction in Multi-Task Dense Prediction
52.Gaze
- Multi-view Gaze Target Estimation
:house:project - Modeling Human Gaze Behavior with Diffusion Models for Unified Scanpath Prediction
:star:code
:star:code视觉注意力预测 - Gaze-Language Alignment for Zero-Shot Prediction of Visual Search Targets from Human Gaze Scanpaths
- What we need is explicit controllability Training 3D gaze estimator using only facial images
:star:code
51.Visual Relationship Detection,VRD(视觉关系检测)
50.Protecting copyright(保护版权)
- TAG-WM: Tamper-Aware Generative Image Watermarking via Diffusion Inversion Sensitivity
- Your Text Encoder Can Be An Object-Level Watermarking Controller
- SpecGuard Spectral Projection-based Advanced Invisible Watermarking
:star:code - Learning Robust Image Watermarking with Lossless Cover Recovery
:star:code - SynTag Enhancing the Geometric Robustness of Inversion-based Generative Image Watermarking
- PlugMark A Plug-in Zero-Watermarking Framework for Diffusion Models
- ROAR Reducing Inversion Error in Generative Image Watermarking
- SEAL Semantic Aware Image Watermarking
- Semantic Watermarking Reinvented Enhancing Robustness and Generation Quality with Fourier Integrity
:star:code - Invisible Watermarks Visible Gains Steering Machine Unlearning with Bi-Level Watermarking Design
- TrustMark Robust Watermarking and Watermark Removal for Arbitrary Resolution Images
- Attention to Neural Plagiarism Diffusion Models Can Plagiarize Your Copyrighted Images
:star:code - From Imitation to Innovation The Emergence of AIs Unique Artistic Styles and the Challenge of Copyright Protection
49.biometric recognition(生物特征识别)
- DisenQ: Disentangling Q-Former for Activity-Biometrics
- A Quality-Guided Mixture of Score-Fusion Experts Framework for Human Recognition
- 指纹
48.Industrial Anomaly Detection(工业异常检测)
- RareCLIP Rarity-aware Online Zero-shot Industrial Anomaly Detection
:star:code - ReMP-AD Retrieval-enhanced Multi-modal Prompt Fusion for Few-Shot Industrial Visual Anomaly Detection
:star:code - G2SF Geometry-Guided Score Fusion for Multimodal Industrial Anomaly Detection
:star:code - Anomaly Detection of Integrated Circuits Package Substrates Using the Large Vision Model SAIC Dataset Construction Methodology and Application
:star:code - SeaS Few-shot Industrial Anomaly Image Generation with Separation and Sharing Fine-tuning
:star:code - Kaputt A Large-Scale Dataset for Visual Defect Detection
- Training-Free Industrial Defect Generation with Diffusion Models
- DADet Safeguarding Image Conditional Diffusion Models against Adversarial and Backdoor Attacks via Diffusion Anomaly Detection
- Bridging 3D Anomaly Localization and Repair via High-Quality Continuous Geometric Representation
47.Animation(动画)
- LayerAnimate: Layer-level Control for Animation
- Occlusion-robust Stylization for Drawing-based 3D Animation
- Multi-Object Sketch Animation by Scene Decomposition and Motion Planning
- Animate Anyone 2 High-Fidelity Character Image Animation with Environment Affordance
- LongAnimation Long Animation Generation with Dynamic Global-Local Memory
- V2M4 4D Mesh Animation Reconstruction from a Single Monocular Video
- OmniHuman-1 Rethinking the Scaling-Up of One-Stage Conditioned Human Animation Models
- Multi-identity Human Image Animation with Structural Video Diffusion
:star:code - Perception-as-Control Fine-grained Controllable Image Animation with 3D-aware Motion Representation
- DreamActor-M1 Holistic Expressive and Robust Human Image Animation with Hybrid Guidance
- Ponimator Unfolding Interactive Pose for Versatile Human-human Interaction Animation
:house:project
46.Sound
- Music Grounding by Short Video
- VGGSounder Audio-Visual Evaluations for Foundation Models
- AV-Flow Transforming Text to Audio-Visual Human-like Interactions
- MUG: Pseudo Labeling Augmented Audio-Visual Mamba Network for Audio-Visual Video Parsing
:star:code - What's Making That Sound Right Now? Video-centric Audio-Visual Localization
:star:code - Implicit Counterfactual Learning for Audio-Visual Segmentation
- Towards Omnimodal Expressions and Reasoning in Referring Audio-Visual Segmentation
:house:project - Zero-AVSR Zero-Shot Audio-Visual Speech Recognition with LLMs by Learning Language-Agnostic Speech Representations
- Not Only Vision Evolve Visual Speech Recognition via Peripheral Information
- CogCM Cognition-Inspired Contextual Modeling for Audio-Visual Speech Enhancement
- How Do Optical Flow and Textual Prompts Collaborate to Assist in Audio-Visual Semantic Segmentation
- TAViS Text-bridged Audio-Visual Segmentation with Foundation Models
- AV-Link Temporally-Aligned Diffusion Features for Cross-Modal Audio-Video Generation
- AURELIA Test-time Reasoning Distillation in Audio-Visual LLMs
- p-AVAS Can Physics-Integrated Audio-Visual Modeling Boost Neural Acoustic Synthesis
- TARO Timestep-Adaptive Representation Alignment with Onset-Aware Conditioning for Synchronized Video-to-Audio Synthesis
- VAFlow Video-to-Audio Generation with Cross-Modality Flow Matching
- Shot-by-Shot Film-Grammar-Aware Training-Free Audio Description Generation
- AVTrustBench Assessing and Enhancing Reliability and Robustness in Audio-Visual LLMs
- 合成语音检测
45.Dataset
- Context-Aware Academic Emotion Dataset and Benchmark
:star:code - ROADWork A Dataset and Benchmark for Learning to Recognize Observe Analyze and Drive Through Work Zones
- 4D-Bench Benchmarking Multi-modal Large Language Models for 4D Object Understanding
- Bias in Gender Bias Benchmarks How Spurious Features Distort Evaluation
- 基准
- IRGPT: Understanding Real-world Infrared Image with Bi-cross-modal Curriculum on Large-scale Benchmark
- Towards Video Thinking Test: A Holistic Benchmark for Advanced Video Reasoning and Understanding
:star:code - One Object Multiple Lies A Benchmark for Cross-task Adversarial Attack on Unified Vision-Language Models
- Beyond the Destination A Novel Benchmark for Exploration-Aware Embodied Question Answering
- JailbreakDiffBench A Comprehensive Benchmark for Jailbreaking Diffusion Models
- MMReason An Open-Ended Multi-Modal Multi-Step Reasoning Benchmark for MLLMs Toward AGI
- GRAB A Challenging GRaph Analysis Benchmark for Large Multimodal Models
- INS-MMBench A Comprehensive Benchmark for Evaluating LVLMs Performance in Insurance
:star:code - MIEB Massive Image Embedding Benchmark
:star:code - LVBench An Extreme Long Video Understanding Benchmark
- ProJudge A Multi-Modal Multi-Discipline Benchmark and Instruction-Tuning Dataset for MLLM-based Process Judges
- From Abyssal Darkness to Blinding Glare A Benchmark on Extreme Exposure Correction in Real World
:star:code - Beyond Walking A Large-Scale Image-Text Benchmark for Text-based Person Anomaly Search
- MultiVerse A Multi-Turn Conversation Benchmark for Evaluating Large Vision and Language Models
- Extrapolated Urban View Synthesis Benchmark
- WorldScore A Unified Evaluation Benchmark for World Generation
:house:project - ICE-Bench A Unified and Comprehensive Benchmark for Image Creating and Editing
- MVGBench a Comprehensive Benchmark for Multi-view Generation Models
- 数据集
- Interaction-Merged Motion Planning: Effectively Leveraging Diverse Motion Datasets for Robust Planning
- ProGait: A Multi-Purpose Video Dataset and Benchmark for Transfemoral Prosthesis Users
:star:code
:house:project - DiffTell A High-Quality Dataset for Describing Image Manipulation Changes
- CT-ScanGaze: A Dataset and Baselines for 3D Volumetric Scanpath Modeling
- Perceiving and Acting in First-Person: A Dataset and Benchmark for Egocentric Human-Object-Human Interactions
:star:code
:star:code - HumanOLAT: A Large-Scale Dataset for Full-Body Human Relighting and Novel-View Synthesis
:house:project - Dataset Ownership Verification for Pre-trained Masked Models
:star:code - Asynchronous Event Error-Minimizing Noise for Safeguarding Event Dataset
:star:code - BlueNeg A 35mm Negative Film Dataset for Restoring Channel-Heterogeneous Deterioration
- CMB-ML A Cosmic Microwave Background Dataset for the Oldest Possible Computer Vision Task
:star:code - UAVScenes A Multi-Modal Dataset for UAVs
:star:code - UDC-VIT A Real-World Video Dataset for Under-Display Cameras
- Towards Comprehensive Lecture Slides Understanding Large-scale Dataset and Effective Method
- R-LiViT A LiDAR-Visual-Thermal Dataset Enabling Vulnerable Road User Focused Roadside Perception
- MEH A Multi-Style Dataset and Toolkit for Advancing Egyptian Hieroglyph Recognition
- 3DRealCar An In-the-wild RGB-D Car Dataset with 360-degree Views
- PBFG A New Physically-Based Dataset and Removal of Lens Flares and Glares
- Feature Coding in the Era of Large Models Dataset Test Conditions and Benchmark
:star:code - Modeling Saliency Dataset Bias
- TrackVerse A Large-Scale Object-Centric Video Dataset for Image-Level Representation Learning
- OpenSubstance A High-quality Measured Dataset of Multi-View and -Lighting Images and Shapes
:house:project - MMAT-1M A Large Reasoning Dataset for Multimodal Agent Tuning
:star:code - ImageGem In-the-wild Generative Image Interaction Dataset for Generative Model Personalization
- LANGTRAJ Diffusion Model and Dataset for Language-Conditioned Trajectory Simulation
:house:project - LightCity An Urban Dataset for Outdoor Inverse Rendering and Reconstruction under Multi-illumination Conditions
- CULTURE3D A Large-Scale and Diverse Dataset of Cultural Landmarks and Terrains for Gaussian-Based Scene Rendering
- A Real-world Display Inverse Rendering Dataset
:house:project
- 数据蒸馏
- CaO: Rectifying Inconsistencies in Diffusion-Based Dataset Distillation
:star:code - Dataset Distillation via Vision-Language Category Prototype
:star:code - Dataset Distillation as Data Compression: A Rate-Utility Perspective
- Heavy Labels Out Dataset Distillation with Label Space Lightening
:star:code - Dataset Distillation via the Wasserstein Metric
:star:code :house:project - Diversity-Enhanced Distribution Alignment for Dataset Distillation
- Improving Noise Efficiency in Privacy-preserving Dataset Distillation
- CaO: Rectifying Inconsistencies in Diffusion-Based Dataset Distillation
44.Neural Radiance Fields
- UnMix-NeRF: Spectral Unmixing Meets Neural Radiance Fields
:house:project - LocalDyGS: Multi-view Global Dynamic Scene Modeling via Adaptive Local Implicit Feature Decoupling
:star:code - DiSCO-3D : Discovering and segmenting Sub-Concepts from Open-vocabulary queries in NeRF
- A View-consistent Sampling Method for Regularized Training of Neural Radiance Fields
- NeuraLeaf: Neural Parametric Leaf Models with Shape and Deformation Disentanglement
:star:code - MuGS Multi-Baseline Generalizable Gaussian Splatting Reconstruction
:star:code - UniVerse Unleashing the Scene Prior of Video Diffusion Models for Robust Radiance Field Reconstruction
- 渲染
- BokehDiff: Neural Lens Blur with One-Step Diffusion
- OccluGaussian: Occlusion-Aware Gaussian Splatting for Large Scene Reconstruction and Rendering
:house:project - ReCamMaster Camera-Controlled Generative Rendering from A Single Video
- Leveraging 2D Priors and SDF Guidance for Urban Scene Rendering
- Bokehlicious Photorealistic Bokeh Rendering with Controllable Apertures
- UNIS A Unified Framework for Achieving Unbiased Neural Implicit Surfaces in Volume Rendering
- Stochastic Gradient Estimation for Higher-Order Differentiable Rendering
- Learning Null Geodesics for Gravitational Lensing Rendering in General Relativity
- FonTS Text Rendering With Typography and Style Controls
- Differentiable Room Acoustic Rendering with Multi-View Vision Priors
- 逆向渲染
- Neural Multi-View Self-Calibrated Photometric Stereo without Photometric Stereo Cues
- Ouroboros Single-step Diffusion Models for Cycle-consistent Forward and Inverse Rendering
- Neural Inverse Rendering for High-Accuracy 3D Measurement of Moving Objects with Fewer Phase-Shifting Patterns
- InvRGB+L: Inverse Rendering of Complex Scenes with Unified Color and LiDAR Reflectance Modeling
- DNF-Intrinsic Deterministic Noise-Free Diffusion for Indoor Inverse Rendering
:star:code
- NVS
- FVGen Accelerating Novel-View Synthesis with Adversarial Video Diffusion Distillation
- E-NeMF Event-based Neural Motion Field for Novel Space-time View Synthesis of Dynamic Scenes
- Self-Ensembling Gaussian Splatting for Few-Shot Novel View Synthesis
:house:project - RayZer A Self-supervised Large View Synthesis Model
- BillBoard Splatting (BBSplat) Learnable Textured Primitives for Novel View Synthesis
- WAVE Warp-Based View Guidance for Consistent Novel View Synthesis Using a Single Image
- UniGS Modeling Unitary 3D Gaussians for Novel View Synthesis from Sparse-view Images
:star:code - Scaling Transformer-Based Novel View Synthesis with Models Token Disentanglement and Synthetic Data
- SEHDR Single-Exposure HDR Novel View Synthesis via 3D Gaussian Bracketing
- RayGaussX Accelerating Gaussian-Based Ray Marching for Real-Time and High-Quality Novel View Synthesis
43.Vision Language(视觉语言)
- Improving Large Vision and Language Models by Learning from a Panel of Peers
- DASH Detection and Assessment of Systematic Hallucinations of VLMs
- Vision-Language Models Cant See the Obvious
- Web Artifact Attacks Disrupt Vision Language Models
:star:code - ONLY: One-Layer Intervention Sufficiently Mitigates Hallucinations in Large Vision-Language Models
:star:code
:star:code - VLM4D Towards Spatiotemporal Awareness in Vision Language Models
- WalkVLM Aid Visually Impaired People Walking by Vision Language Model
- ViLU: Learning Vision-Language Uncertainties for Failure Prediction
:star:code - PRISM: Reducing Spurious Implicit Biases in Vision-Language Models with LLM-Guided Embedding Projection
:star:code - One Last Attention for Your Vision-Language Model
:star:code - Hierarchical Cross-modal Prompt Learning for Vision-Language Models
:star:code - METEOR: Multi-Encoder Collaborative Token Pruning for Efficient Vision Language Models
:star:code - ATCTrack: Aligning Target-Context Cues with Dynamic Target States for Robust Vision-Language Tracking
:star:code - AgroBench: Vision-Language Model Benchmark in Agriculture
:star:code - MM-IFEngine Towards Multimodal Instruction Following
- Robustifying Zero-Shot Vision Language Models by Subspaces Alignment
- FDPT Federated Discrete Prompt Tuning for Black-Box Visual-Language Models
- Griffon v2 Advancing Multimodal Perception with High-Resolution Scaling and Visual-Language Co-Referring
:star:code - CLIP-GS Unifying Vision-Language Representation with 3D Gaussian Splatting
- Growing a Twig to Accelerate Large Vision-Language Models
- Test-Time Retrieval-Augmented Adaptation for Vision-Language Models
:star:code - Understanding Museum Exhibits using Vision-Language Reasoning
- One Perturbation is Enough On Generating Universal Adversarial Perturbations against Vision-Language Pre-training Models
- When Lighting Deceives Exposing Vision-Language Models Illumination Vulnerability Through Illumination Transformation Attack
- Target Bias Is All You Need Zero-Shot Debiasing of Vision-Language Models with Bias Corpus
- TAB Transformer Attention Bottlenecks enable User Intervention and Debugging in Vision-Language Models
- Feather the Throttle Revisiting Visual Token Pruning for Vision-Language Model Acceleration
- Derm1M A Million-scale Vision-Language Dataset Aligned with Clinical Ontology Knowledge for Dermatology
:star:code - ReCoT Reflective Self-Correction Training for Mitigating Confirmation Bias in Large Vision-Language Models
- AutoOcc Automatic Open-Ended Semantic Occupancy Annotation via Vision-Language Guided Gaussian Splatting
- D-Attn Decomposed Attention for Large Vision-and-Language Model
:star:code - Deciphering Cross-Modal Alignment in Large Vision-Language Models via Modality Integration Rate
:star:code - Fuzzy Contrastive Decoding to Alleviate Object Hallucination in Large Vision-Language Models
- IDEATOR Jailbreaking and Benchmarking Large Vision-Language Models Using Themselves
- 25 Years in Class A Multimodal Textbook for Vision-Language Pretraining
- Enhancing Few-Shot Vision-Language Classification with Large Multimodal Model Features
- FedMVP Federated Multimodal Visual Prompt Tuning for Vision-Language Models
:star:code - Physics Context Builders A Modular Framework for Physical Reasoning in Vision-Language Models
- VLRMBench A Comprehensive and Challenging Benchmark for Vision-Language Reward Models
:star:code - ZipVL Accelerating Vision-Language Models through Dynamic Token Sparsity
- Skip-Vision Efficient and Scalable Acceleration of Vision-Language Models via Adaptive Token Skipping
- SAUCE Selective Concept Unlearning in Vision-Language Models with Sparse Autoencoders
- The Inter-Intra Modal Measure A Predictive Lens on Fine-Tuning Outcomes in Vision-Language Models
:star:code - MaTVLM Hybrid Mamba-Transformer for Efficient Vision-Language Modeling
:star:code - Safeguarding Vision-Language Models Mitigating Vulnerabilities to Gaussian Noise in Perturbation-based Attacks
:star:code - Dynamic Multimodal Prototype Learning in Vision-Language Models
- GEOBench-VLM Benchmarking Vision-Language Models for Geospatial Tasks
- Towards Cross-modal Backward-compatible Representation Learning for Vision-Language Models
- V2PE Improving Multimodal Long-Context Capability of Vision-Language Models with Variable Visual Position Encoding
- DexVLG Dexterous Vision-Language-Grasp Model at Scale
- Vision-Language Neural Graph Featurization for Extracting Retinal Lesions
- MotionCtrl A Real-time Controllable Vision-Language-Motion Model
- Breaking the Encoder Barrier for Seamless Video-Language Understanding
- OphCLIP Hierarchical Retrieval-Augmented Learning for Ophthalmic Surgical Video-Language Pretraining
- How Can Objects Help Video-Language Understanding
:star:code - Factorized Learning for Temporally Grounded Video-Language Models
:star:code - Multi-Cache Enhanced Prototype Learning for Test-Time Generalization of Vision-Language Models
- AdvDreamer Unveils Are Vision-Language Models Truly Ready for Real-World 3D Variations
- HQ-CLIP Leveraging Large Vision-Language Models to Create High-Quality Image-Text Datasets and CLIP Models
- Perspective-Aware Reasoning in Vision-Language Models via Mental Imagery Simulation
- The Scalability of Simplicity Empirical Analysis of Vision-Language Learning with a Single Transformer
:star:code - EVEv2 Improved Baselines for Encoder-Free Vision-Language Models
:star:code - TruthPrInt Mitigating Large Vision-Language Models Object Hallucination Via Latent Truthful-Guided Pre-Intervention
- Structured Policy Optimization Enhance Large Vision-Language Model via Self-referenced Dialogue
- Causality-guided Prompt Learning for Vision-language Models via Visual Granulation
:star:code - CalliReader Contextualizing Chinese Calligraphy via an Embedding-Aligned Vision-Language Model
- Does Your Vision-Language Model Get Lost in the Long Video Sampling Dilemma
- Normal and Abnormal Pathology Knowledge-Augmented Vision-Language Model for Anomaly Detection in Pathology Images
- Uncertainty-Driven Expert Control Enhancing the Reliability of Medical Vision-Language Models
- Dynamic Multi-Layer Null Space Projection for Vision-Language Continual Learning
- Learning Beyond Still Frames Scaling Vision-Language Models with Video
- GLEAM Enhanced Transferable Adversarial Attacks for Vision-Language Pre-training Models via Global-Local Transformations
:star:code - INTER Mitigating Hallucination in Large Vision-Language Models by Interaction Guidance Sampling
:star:code - SmolDocling An ultra-compact vision-language model for end-to-end multi-modal document conversion
- VLN
- Rethinking the Embodied Gap in Vision-and-Language Navigation: A Holistic Study of Physical and Visual Disparities
:star:code - monoVLN Bridging the Observation Gap between Monocular and Panoramic Vision and Language Navigation
- NavQ Learning a Q-Model for Foresighted Vision-and-Language Navigation
- COSMO Combination of Selective Memorization for Low-cost Vision-and-Language Navigation
:star:code - NavMorph: A Self-Evolving World Model for Vision-and-Language Navigation in Continuous Environments
:star:code - 3D Gaussian Map with Open-Set Semantic Grouping for Vision-Language Navigation
- Rethinking the Embodied Gap in Vision-and-Language Navigation: A Holistic Study of Physical and Visual Disparities
- LLM
- LLM-enhanced Action-aware Multi-modal Prompt Tuning for Image-Text Matching
- Aligning Information Capacity Between Vision and Language via Dense-to-Sparse Feature Distillation for Image-Text Matching
- Multi-Granular Spatio-Temporal Token Merging for Training-Free Acceleration of Video LLMs
:house:project - Why LVLMs Are More Prone to Hallucinations in Longer Responses The Role of Context
- Zeroth-Order Fine-Tuning of LLMs in Random Subspaces
:star:code - Advancing Visual Large Language Model for Multi-granular Versatile Perception
:star:code - DisTime Distribution-based Time Representation for Video Large Language Models
:star:code - Aligning Effective Tokens with Video Anomaly in Large Language Models
- MeshLLM Empowering Large Language Models to Progressively Understand and Generate 3D Mesh
- FOLDER Accelerating Multi-Modal Large Language Models with Enhanced Performance
:star:code - B-VLLM A Vision Large Language Model with Balanced Spatio-Temporal Tokens
- Robin3D Improving 3D Large Language Model via Robust Instruction Tuning
- GenieBlue Integrating both Linguistic and Multimodal Capabilities for Large Language Models on Mobile Devices
- CATP-LLM Empowering Large Language Models for Cost-Aware Tool Planning
:star:code - Multimodal LLM Guided Exploration and Active Mapping using Fisher Information
- Multimodal Large Language Model-Guided ISP Hyperparameter Optimization with Dynamic Preference Learning
- Aligning Vision to Language Annotation-Free Multimodal Knowledge Graph Construction for Enhanced LLMs Reasoning
:star:code
- MLLM
- Token Activation Map to Visually Explain Multimodal LLMs
:star:code - DisCo: Towards Distinct and Coherent Visual Encapsulation in Video MLLMs
:star:code - UrbanLLaVA: A Multi-modal Large Language Model for Urban Intelligence with Spatial Reasoning and Understanding
:star:code - Kestrel 3D Multimodal LLM for Part-Aware Grounded Description
- Are They the Same Exploring Visual Correspondence Shortcomings of Multimodal LLMs
- Analyzing Finetuning Representation Shift for Multimodal LLMs Steering
- Visual Chronicles Using Multimodal LLMs to Analyze Massive Collections of Images
- Controlling Multimodal LLMs via Reward-guided Decoding
- TWIST SCOUT Grounding Multimodal LLM-Experts by Forget-Free Tuning
- FinMMR: Make Financial Numerical Reasoning More Multimodal, Comprehensive, and Challenging
- Bootstrapping Grounded Chain-of-Thought in Multimodal LLMs for Data-Efficient Model Adaptation
- BASIC: Boosting Visual Alignment with Intrinsic Refined Embeddings in Multimodal Large Language Models
- Corvid: Improving Multimodal Large Language Models Towards Chain-of-Thought Reasoning
:star:code - Instruction-Oriented Preference Alignment for Enhancing Multi-Modal Comprehension Capability of MLLMs
- CompCap Improving Multimodal Large Language Models with Composite Captions
- AVAM a Universal Training-free Adaptive Visual Anchoring Embedded into Multimodal Large Language Model for Multi-image Question Answering
- How Do Multimodal Large Language Models Handle Complex Multimodal Reasoning Placing Them in An Extensible Escape Game
- LLaVA-KD A Framework of Distilling Multimodal Large Language Models
- LIRA Reasoning Reconstruction via Multimodal Large Language Models
:star:code - MissRAG Addressing the Missing Modality Challenge in Multimodal Large Language Models
:star:code - Visual-Oriented Fine-Grained Knowledge Editing for MultiModal Large Language Models
:star:code - Benchmarking Multimodal Large Language Models Against Image Corruptions
- SHIFT Smoothing Hallucinations by Information Flow Tuning for Multimodal Large Language Models
- Jailbreaking Multimodal Large Language Models via Shuffle Inconsistency
- VisNumBench Evaluating Number Sense of Multimodal Large Language Models
- ShortV Efficient Multimodal Large Language Models by Freezing Visual Tokens in Ineffective Layers
:star:code - Heuristic-Induced Multimodal Risk Distribution Jailbreak Attack for Multimodal Large Language Models
:star:code - Learning to Inference Adaptively for Multimodal Large Language Models
- FALCON Resolving Visual Redundancy and Fragmentation in High-resolution Multimodal Large Language Models via Visual Registers
- R1-VL Learning to Reason with Multimodal Large Language Models via Step-wise Group Relative Policy Optimization
- Calibrating MLLM-as-a-judge via Multimodal Bayesian Prompt Ensembles
- Boosting MLLM Reasoning with Text-Debiased Hint-GRPO
:star:code - Information Density Principle for MLLM Benchmarks
- Auto-Controlled Image Perception in MLLMs via Visual Perception Tokens
- VSP Diagnosing the Dual Challenges of Perception and Reasoning in Spatial Planning Tasks for MLLMs
- MM-Spatial Exploring 3D Spatial Understanding in Multimodal LLMs
- Spatial Preference Rewarding for MLLMs Spatial Understanding
:star:code - SparseMM Head Sparsity Emerges from Visual Concept Responses in MLLMs
:star:code - OrderChain Towards General Instruct-Tuning for Stimulating the Ordinal Understanding Ability of MLLM
:house:project - STI-Bench Are MLLMs Ready for Precise Spatial-Temporal World Understanding
- ChartPoint Guiding MLLMs with Grounding Reflection for Chart Reasoning
- Constructing Ophthalmic MLLM for Positioning-diagnosis Collaboration Through Clinical Cognitive Chain Reasoning
:star:code - p-MoD Building Mixture-of-Depths MLLMs via Progressive Ratio Decay
- LLaVA-SP Enhancing Visual Representation with Visual Spatial Tokens for MLLMs
:star:code - Enhancing Numerical Prediction of MLLMs with Soft Labeling
- Creation-MMBench Assessing Context-Aware Creative Intelligence in MLLMs
- Token Activation Map to Visually Explain Multimodal LLMs
- Visual Grounding
- PropVG End-to-End Proposal-Driven Visual Grounding with Multi-Granularity Discrimination
- Move to Understand a 3D Scene Bridging Visual Grounding and Exploration for Efficient and Versatile Embodied Navigation
- MC-Bench A Benchmark for Multi-Context Visual Grounding in the Era of MLLMs
:house:project - AerialVG A Challenging Benchmark for Aerial Visual Grounding by Exploring Positional Relations
:star:code - NAVER A Neuro-Symbolic Compositional Automaton for Visual Grounding with Explicit Logic Reasoning
:star:code - VGMamba Attribute-to-Location Clue Reasoning for Quantity-Agnostic 3D Visual Grounding
- Region-aware Anchoring Mechanism for Efficient Referring Visual Grounding
- REC
42.Vision Transformer
- Boosting Generative Adversarial Transferability with Self-supervised Vision Transformer Features
:star:code - Efficient Adaptation of Pre-trained Vision Transformer underpinned by Approximately Orthogonal Fine-Tuning Strategy
- EA-ViT: Efficient Adaptation for Elastic Vision Transformer
:star:code - MixA-Q: Revisiting Activation Sparsity for Vision Transformers from a Mixed-Precision Quantization Perspective
- OminiControl Minimal and Universal Control for Diffusion Transformer
- Pinco Position-induced Consistent Adapter for Diffusion Transformer in Foreground-conditioned Inpainting
- SAFER Sharpness Aware layer-selective Finetuning for Enhanced Robustness in vision transformers
- OmniCache A Trajectory-Oriented Global Perspective on Training-Free Cache Reuse for Diffusion Transformer Models
- Sparse Fine-Tuning of Transformers for Generative Tasks
- MaTe Images Are All You Need for Material Transfer via Diffusion Transformer
- Hybrid Layout Control for Diffusion Transformer Fewer Annotations Superior Aesthetics
- UniCombine Unified Multi-Conditional Combination with Diffusion Transformer
- EasyControl Adding Efficient and Flexible Control for Diffusion Transformer
- Accelerating Diffusion Transformer via Gradient-Optimized Cache
:star:code - LeGrad An Explainability Method for Vision Transformers via Feature Formation Sensitivity
- An Efficient Hybrid Vision Transformer for TinyML Applications
:star:code - MixA A Mixed Attention approach with Stable Lightweight Linear Attention to enhance Efficiency of Vision Transformers at the Edge
41.Neural Architecture Search(神经架构搜索)
- Neural Architecture Search Driven by Locally Guided Diffusion for Personalized Federated Learning
- Loss Functions for Predictor-based Neural Architecture Search
- TRNAS A Training-Free Robust Neural Architecture Search
40.Deep learning(深度学习)
- 胶囊网络
- RNN
39.Machine learning(机器学习)
- 机器遗忘
- MUNBa Machine Unlearning via Nash Bargaining
- Robust Machine Unlearning for Quantized Neural Networks via Adaptive Gradient Reweighting with Similar Labels
- Learning to Unlearn while Retaining Combating Gradient Conflicts in Machine Unlearning
- Reminiscence Attack on Residuals Exploiting Approximate Machine Unlearning for Privacy
- 主动学习
- 对比学习
- Vector Contrastive Learning For Pixel-Wise Pretraining In Medical Vision
- Selective Contrastive Learning for Weakly Supervised Affordance Grounding
- Fix-CLIP Dual-Branch Hierarchical Contrastive Learning via Synthetic Captions for Better Understanding of Long Text
:star:code - Robust Dataset Condensation using Supervised Contrastive Learning
:star:code - Differential-informed Sample Selection Accelerates Multimodal Contrastive Learning
:star:code - Backdooring Self-Supervised Contrastive Learning by Noisy Alignment
:star:code - Salvaging the Overlooked Leveraging Class-Aware Contrastive Learning for Multi-Class Anomaly Detection
- AMD Adaptive Momentum and Decoupled Contrastive Learning Framework for Robust Long-Tail Trajectory Prediction
- 强化学习
- RL-Selector: Reinforcement Learning-Guided Data Selection via Redundancy Assessment
- Reinforcement Learning-Guided Data Selection via Redundancy Assessment
- RIPE: Reinforcement Learning on Unlabeled Image Pairs for Robust Keypoint Extraction
:star:code - DocThinker: Explainable Multimodal Large Language Models with Rule-based Reinforcement Learning for Document Understanding
:star:code - DeepMesh Auto-Regressive Artist-mesh Creation with Reinforcement Learning
- ULTHO Ultra-Lightweight yet Efficient Hyperparameter Optimization in Deep Reinforcement Learning
- Disentangled World Models Learning to Transfer Semantic Knowledge from Distracting Videos for Reinforcement Learning
- One Encoder to Rule them All Representation Learning for Model-free Visual Reinforcement Learning using Fourier Neural Operators
- Diffusion Guided Adaptive Augmentation for Generalization in Visual Reinforcement Learning
- GenFlowRL Shaping Rewards with Generative Object-Centric Flow in Visual Reinforcement Learning
:house:project :house:project
- 持续学习
- CL-Splats: Continual Learning of Gaussian Splatting with Local Optimization
:star:code - PROL : Rehearsal Free Continual Learning in Streaming Data via Prompt Online Learning
:star:code - Mind the Gap: Preserving and Compensating for the Modality Gap in CLIP-Based Continual Learning
:star:code - RainbowPrompt: Diversity-Enhanced Prompt-Evolving for Continual Learning
- Instruction-Grounded Visual Projectors for Continual Learning of Generative Vision-Language Models
- Divide-and-Conquer for Enhancing Unlabeled Learning, Stability, and Plasticity in Semi-supervised Continual Learning
:star:code - Any-SSR How Recursive Least Squares Works in Continual Learning of Large Language Model
:star:code - Joint Diffusion Models in Continual Learning
- PLAN Proactive Low-Rank Allocation for Continual Learning
- Divide-and-Conquer for Enhancing Unlabeled Learning Stability and Plasticity in Semi-supervised Continual Learning
:star:code - CODE-CL Conceptor-Based Gradient Projection for Deep Continual Learning
- FedAGC Federated Continual Learning with Asymmetric Gradient Correction
- CL-Splats: Continual Learning of Gaussian Splatting with Local Optimization
- 对抗学习
- TITAN Query-Token based Domain Adaptive Adversarial Learning
:star:code - ZIUM: Zero-Shot Intent-Aware Adversarial Attack on Unlearned Models
- Pretend Benign A Stealthy Adversarial Attack by Exploiting Vulnerabilities in Cooperative Perception
- KOEnsAttack Towards Efficient Data-Free Black-Box Adversarial Attacks via Knowledge-Orthogonalized Substitute Ensembles
- SMP-Attack Boosting the Transferability of Feature Importance-based Adversarial Attack with Semantics-aware Multi-granularity Patchout
:star:code - DISTIL: Data-Free Inversion of Suspicious Trojan Inputs via Latent Diffusion
:star:code - Revisiting Adversarial Patch Defenses on Object Detectors: Unified Evaluation, Large-Scale Dataset, and New Insights
:star:code - Towards a 3D Transfer-based Black-box Attack via Critical Feature Guidance
:star:code - Boosting Adversarial Transferability via Residual Perturbation Attack
:star:code - Confound from All Sides Distill with Resilience Multi-Objective Adversarial Paths to Zero-Shot Robustness
- Adversarial Training for Probabilistic Robustness
- Mitigating Catastrophic Overfitting in Fast Adversarial Training via Label Information Elimination
:star:code - Towards Adversarial Robustness via Debiased High-Confidence Logit Alignment
:star:code - Adversarial Exploitation of Data Diversity Improves Visual Localization
- FedPall Prototype-based Adversarial and Collaborative Learning for Federated Learning with Feature Drift
- Adversarial Robust Memory-Based Continual Learner
- ViT-EnsembleAttack Augmenting Ensemble Models for Stronger Adversarial Transferability in Vision Transformers
:star:code - CIARD Cyclic Iterative Adversarial Robustness Distillation
:star:code - Failure Cases Are Better Learned But Boundary Says Sorry Facilitating Smooth Perception Change for Accuracy-Robustness Trade-Off in Adversarial Training
:star:code - Backdoor Mitigation by Distance-Driven Detoxification
- Mind the Cost of Scaffold Benign Clients May Even Become Accomplices of Backdoor Attack
- Prototype Guided Backdoor Defense via Activation Space Manipulation
- Leveraging Spatial Invariance to Boost Adversarial Transferability
:star:code - SPD Shallow Backdoor Protecting Deep Backdoor Against Backdoor Detection
:star:code - Backdoor Defense via Enhanced Splitting and Trap Isolation
- Backdoor Attacks on Neural Networks via One-Bit Flip
- Seal Your Backdoor with Variational Defense
- Enhancing Adversarial Transferability by Balancing Exploration and Exploitation with Gradient-Guided Sampling
:star:code - Enhancing Transferability of Targeted Adversarial Examples via Inverse Target Gradient Competition and Spatial Distance Stretching
- Boosting Adversarial Transferability via Negative Hessian Trace Regularization
- Unified Adversarial Augmentation for Improving Palmprint Recognition
- DIA The Adversarial Exposure of Deterministic Inversion in Diffusion Models
- Generative Adversarial Diffusion
- ODDR Outlier Detection Dimension Reduction Based Defense Against Adversarial Patches
- Scaling and Taming Adversarial Training with Synthetic Data
- TITAN Query-Token based Domain Adaptive Adversarial Learning
- 多模态学习
- GD: Boosting Multimodal Learning with Gradient-Guided Distillation
:star:code - Improving Multimodal Learning via Imbalanced Learning
:star:code - SimMLM: A Simple Framework for Multi-modal Learning with Missing Modality
- Unbiased Missing-modality Multimodal Learning
:house:project - Boosting Multimodal Learning via Disentangled Gradient Learning
:star:code - OpenVision A Fully-Open Cost-Effective Family of Advanced Vision Encoders for Multimodal Learning
- GD: Boosting Multimodal Learning with Gradient-Guided Distillation
- 多任务学习
- Resolving Token-Space Gradient Conflicts: Token Space Manipulation for Transformer-Based Multi-Task Learning
- Beyond Losses Reweighting Empowering Multi-Task Learning via the Generalization Perspective
- Resolving Token-Space Gradient Conflicts Token Space Manipulation for Transformer-Based Multi-Task Learning
- Rep-MTL: Unleashing the Power of Representation-level Task Saliency for Multi-Task Learning
:star:code - TurboTrain: Towards Efficient and Balanced Multi-Task Learning for Multi-Agent Perception and Prediction
- ModalTune Fine-Tuning Slide-Level Foundation Models with Multi-Modal Information for Multi-task Learning in Digital Pathology
- Active Membership Inference Test (aMINT) Enhancing Model Auditability with Multi-Task Learning
:star:code
- 类增量学习
- Revisiting Pool-based Prompt Learning for Few-shot Class-incremental Learning
:star:code - Integrating Task-Specific and Universal Adapters for Pre-Trained Model-based Class-Incremental Learning
:star:code - Achieving More with Less Additive Prompt Tuning for Rehearsal-Free Class-Incremental Learning
- Lark Low-Rank Updates After Knowledge Localization for Few-shot Class-Incremental Learning
- A Tiny Change A Giant Leap Long-Tailed Class-Incremental Learning via Geometric Prototype Alignment
:star:code - Task-Aware Prompt Gradient Projection for Parameter-Efficient Tuning Federated Class-Incremental Learning
- External Knowledge Injection for CLIP-Based Class-Incremental Learning
:star:code - ESSENTIAL Episodic and Semantic Memory Integration for Video Class-Incremental Learning
- Flexi-FSCIL Adaptive Knowledge Retention for Breaking the Stability-Plasticity Dilemma in Few-Shot Class-Incremental Learning
- Seeing 3D Through 2D Lenses 3D Few-Shot Class-Incremental Learning via Cross-Modal Geometric Rectification
- Feature Decomposition-Recomposition in Large Vision-Language Model for Few-Shot Class-Incremental Learning
- Revisiting Pool-based Prompt Learning for Few-shot Class-incremental Learning
- 增量学习
- 联邦学习
- Federated Representation Angle Learning
- Client2Vec Improving Federated Learning by Distribution Shifts Aware Client Indexing
:star:code - Geminio Language-Guided Gradient Inversion Attacks in Federated Learning
- Sibai A Few-Shot Meta-Classifier for Poisoning Detection in Federated Learning
- You Are Your Own Best Teacher Achieving Centralized-level Performance in Federated Learning under Heterogeneous and Long-tailed Data
- Personalized Federated Learning under Local Supervision
:star:code - FedWSQ Efficient Federated Learning with Weight Standardization and Distribution-Aware Non-Uniform Quantization
- FedXDS Leveraging Model Attribution Methods to counteract Data Heterogeneity in Federated Learning
:star:code - FLSeg Enhancing Privacy and Robustness in Federated Learning under Heterogeneous Data via Model Segmentation
- Find a Scapegoat Poisoning Membership Inference Attack and Defense to Federated Learning
- Forgetting Through Transforming Enabling Federated Unlearning via Class-Aware Representation Transformation
:star:code - Latte Collaborative Test-Time Adaptation of Vision-Language Models in Federated Learning
:star:code - 联邦遗忘学习
- 元学习
- Out-of-Distribution Detection(分布外检测)
- Gradient Short-Circuit: Efficient Out-of-Distribution Detection via Feature Intervention
- NegRefine: Refining Negative Label-Based Zero-Shot OOD Detection
:star:code - FEVER-OOD Free Energy Vulnerability Elimination for Robust Out-of-Distribution Detection
- Beyond Pixel Uncertainty Bounding the OoD Objects in Road Scenes
:star:code - ODP-Bench Benchmarking Out-of-Distribution Performance Prediction
- A Unified Interpretation of Training-Time Out-of-Distribution Detection
- Synthesizing Near-Boundary OOD Samples for Out-of-Distribution Detection
:star:code - Activation Subspaces for Out-of-Distribution Detection
- Diagnosing Pretrained Models for Out-of-distribution Detection
- Equipping Vision Foundation Model with Mixture of Experts for Out-of-Distribution Detection
- DisCoPatch Taming Adversarially-driven Batch Statistics for Improved Out-of-Distribution Detection
- Secure On-Device Video OOD Detection Without Backpropagation
:star:code - FA Forced Prompt Learning of Vision-Language Models for Out-of-Distribution Detection
:star:code - Adaptive Prompt Learning via Gaussian Outlier Synthesis for Out-of-distribution Detection
- Auxiliary Prompt Tuning of Vision-Language Models for Few-Shot Out-of-Distribution Detection
- 异常检测
- Toward Long-Tailed Online Anomaly Detection through Class-Agnostic Concepts
:house:project - DecAD Decoupling Anomalies in Latent Space for Multi-Class Unsupervised Anomaly Detection
- Towards Real Unsupervised Anomaly Detection Via Confident Meta-Learning
- Wave-MambaAD Wavelet-driven State Space Model for Multi-class Unsupervised Anomaly Detection
- Debiasing Trace Guidance Top-down Trace Distillation and Bottom-up Velocity Alignment for Unsupervised Anomaly Detection
- MultiADS Defect-aware Supervision for Multi-type Anomaly Detection and Segmentation in Zero-Shot Learning
- Triad Empowering LMM-based Anomaly Detection with Expert-guided Region-of-Interest Tokenizer and Manufacturing Process
- SALAD -- Semantics-Aware Logical Anomaly Detection
:star:code - SiM3D Single-instance Multiview Multimodal and Multisetup 3D Anomaly Detection Benchmark
- Fine-grained Abnormality Prompt Learning for Zero-shot Anomaly Detection
- Toward Long-Tailed Online Anomaly Detection through Class-Agnostic Concepts
- 表征学习
- Multi-Modal Multi-Task Unified Embedding Model (M3T-UEM) A Task-Adaptive Representation Learning Framework
- LayerLock Non-collapsing Representation Learning with Progressive Freezing
- CARL Causality-guided Architecture Representation Learning for an Interpretable Performance Predictor
- Pretrained Reversible Generation as Unsupervised Visual Representation Learning
:house:project - Region-based Cluster Discrimination for Visual Representation Learning
:star:code - Gradient Extrapolation for Debiased Representation Learning
:house:project - Scaling Language-Free Visual Representation Learning
:star:code - Q-Norm Robust Representation Learning via Quality-Adaptive Normalization
:star:code - Scaling Omni-modal Pretraining with Multimodal Context Advancing Universal Representation Learning Across Modalities
- 提示学习
38.Few/Zero-Shot Learning/DG/Adaptation(小/零样本/域泛化/适应)
- 零样本
- Interpretable Zero-Shot Learning with Locally-Aligned Vision-Language Model
:star:code - OBSER: Object-Based Sub-Environment Recognition for Zero-Shot Environmental Inference
- Language-Driven Multi-Label Zero-Shot Learning with Semantic Granularity
- A Conditional Probability Framework for Compositional Zero-shot Learning
- SVIP Semantically Contextualized Visual Patches for Zero-Shot Learning
:star:code - Learning Visual Proxy for Compositional Zero-Shot Learning
- Verbalized Representation Learning for Interpretable Few-Shot Generalization
- Hierarchical Variational Test-Time Prompt Generation for Zero-Shot Generalization
- Interpretable Zero-Shot Learning with Locally-Aligned Vision-Language Model
- 小样本
- AD
- DG
- Bridging Domain Generalization to Multimodal Domain Generalization via Unified Representations
- Boosting Domain Generalized and Adaptive Detection with Diffusion Models Fitness Generalization and Transferability
- Adversarial Data Augmentation for Single Domain Generalization via Lyapunov Exponent-Guided Optimization
- AdaDCP Learning an Adapter with Discrete Cosine Prior for Clear-to-Adverse Domain Generalization
- ConstStyle Robust Domain Generalization with Unified Style Transformation
- Split-and-Combine Enhancing Style Augmentation for Single Domain Generalization
- Federated Domain Generalization with Domain-specific Soft Prompts Generation
- Whats in a Latent Leveraging Diffusion Latent Space for Domain Generalization
:house:project - Customizing Domain Adapters for Domain Generalization
- 无监督
- 自监督
- Prototype-based Contrastive Learning with Stage-wise Progressive Augmentation for Self-Supervised Fine-Grained Learning
:star:code - MoSiC Optimal-Transport Motion Trajectory for Dense Self-Supervised Learning
:star:code - Self-supervised Learning of Hybrid Part-aware 3D Representations of 2D Gaussians and Superquadrics
- Adversarial Robustness of Discriminative Self-Supervised Learning in Vision
- Prototype-based Contrastive Learning with Stage-wise Progressive Augmentation for Self-Supervised Fine-Grained Learning
- 弱监督
- 半监督
- Semi-ViM Bidirectional State Space Model for Mitigating Label Imbalance in Semi-Supervised Learning
- CaliMatch Adaptive Calibration for Improving Safe Semi-supervised Learning
- Learnable Logit Adjustment for Imbalanced Semi-Supervised Learning under Class Distribution Mismatch
- SemiVisBooster Boosting Semi-Supervised Learning for Fine-Grained Classification through Pseudo-Label Semantic Guidance
:star:code - Semi-supervised Deep Transfer for Regression without Domain Alignment
- Semi-supervised Concept Bottleneck Models
37.Model Compression/Knowledge Distillation/Pruning(模型压缩/知识蒸馏/剪枝)
- Representation Shift: Unifying Token Compression with FlashAttention
:star:code - Dynamic-VLM Simple Dynamic Visual Token Compression for VideoLLM
- AirCache Activating Inter-modal Relevancy KV Cache Compression for Efficient Large Vision-Language Model Inference
- DiTFastAttnV2 Head-wise Attention Compression for Multi-Modality Diffusion Transformers
- 剪枝
- Partial Forward Blocking: A Novel Data Pruning Paradigm for Lossless Training Acceleration
- Variance-Based Pruning for Accelerating and Compressing Trained Networks
- VFlowOpt A Token Pruning Framework for LMMs with Visual Information Flow-Guided Optimization
- WINS Winograd Structured Pruning for Fast Winograd Convolution
- Keyframe-oriented Vision Token Pruning Enhancing Efficiency of Large Vision Language Models on Long-Form Video Processing
- Pruning All-Rounder Rethinking and Improving Inference Efficiency for Large Vision Language Models
:star:code - AIM Adaptive Inference of Multi-Modal LLMs via Token Merging and Pruning
:star:code - FastVAR Linear Visual Autoregressive Modeling via Cached Token Pruning
- Beyond Text-Visual Attention Exploiting Visual Cues for Effective Token Pruning in VLMs
:star:code - MosaicDiff Training-free Structural Pruning for Diffusion Model Acceleration Reflecting Pretraining Dynamics
- 量化
- AHCPTQ Accurate and Hardware-Compatible Post-Training Quantization for Segment Anything Model
- Moment Quantization for Video Temporal Grounding
:star:code - MSQ Memory-Efficient Bit Sparsification Quantization
- Task Vector Quantization for Memory-Efficient Model Merging
:house:project - OuroMamba A Data-Free Quantization Framework for Vision Mamba
:star:code - Semantic Alignment and Reinforcement for Data-Free Quantization of Vision Transformers
- HUST High-Fidelity Unbiased Skin Tone Estimation via Texture Quantization
- D3QE Learning Discrete Distribution Discrepancy-aware Quantization Error for Autoregressive-Generated Image Detection
:star:code - Scheduling Weight Transitions for Quantization-Aware Training
- SSVQ Unleashing the Potential of Vector Quantization with Sign-Splitting
:star:code - ViM-VQ Efficient Post-Training Vector Quantization for Visual Mamba
- Scalable Image Tokenization with Index Backpropagation Quantization
- Allowing Oscillation Quantization Overcoming Solution Space Limitation in Low Bit-Width Quantization
:star:code - QuEST Low-bit Diffusion Model Quantization via Efficient Selective Finetuning
- Memory-Efficient Generative Models via Product Quantization
- DMQ: Dissecting Outliers of Diffusion Models for Post-Training Quantization
:star:code
- KD
- Frequency-Aligned Knowledge Distillation for Lightweight Spatiotemporal Forecasting
:star:code - Local Dense Logit Relations for Enhanced Knowledge Distillation
- Inference-Time Diffusion Model Distillation
- Cross-Architecture Distillation Made Simple with Redundancy Suppression
- Evidential Knowledge Distillation
:star:code - EA-KD Entropy-based Adaptive Knowledge Distillation
:star:code - CleanPose Category-Level Object Pose Estimation via Causal Learning and Knowledge Distillation
:star:code - Knowledge Distillation with Refined Logits
:star:code - What to Distill Fast Knowledge Distillation with Adaptive Sampling
- VRM Knowledge Distillation via Virtual Relation Matching
- Photolithography Overlay Map Generation with Implicit Knowledge Distillation Diffusion Transformer
- Coupling the Generator with Teacher for Effective Data-Free Knowledge Distillation
- ACAM-KD Adaptive and Cooperative Attention Masking for Knowledge Distillation
- Enhanced Event-based Dense Stereo via Cross-Sensor Knowledge Distillation
- Perspective-Aware Teaching Adapting Knowledge for Heterogeneous Distillation
:star:code - Fuse Before Transfer Knowledge Fusion for Heterogeneous Distillation
:star:code - A Good Teacher Adapts Their Knowledge for Distillation
- Frequency-Aligned Knowledge Distillation for Lightweight Spatiotemporal Forecasting
36.Scene Graph Generation(场景图生成)
- SPADE: Spatial-Aware Denoising Network for Open-vocabulary Panoptic Scene Graph Generation with Long- and Local-range Context Reasoning
- Statistical Confidence Rescoring for Robust 3D Scene Graph Generation from Multi-View Images
:star:code
:star:code - FROSS Faster-Than-Real-Time Online 3D Semantic Scene Graph Generation from RGB-D Images
:star:code - TRKT Weakly Supervised Dynamic Scene Graph Generation with Temporal-enhanced Relation-aware Knowledge Transferring
:star:code - End-to-End Entity-Predicate Association Reasoning for Dynamic Scene Graph Generation
:star:code - Vision-Language Interactive Relation Mining for Open-Vocabulary Scene Graph Generation
35.Style Transfer(风格迁移)
- Domain Generalizable Portrait Style Transfer
:star:code - Tune-Your-Style Intensity-tunable 3D Style Transfer with Gaussian Splatting
:house:project
34.Object Pose Estimation(物体姿态估计)
- Deterministic Object Pose Confidence Region Estimation
- BoxDreamer Dreaming Box Corners for Generalizable Object Pose Estimation
- MixRI Mixing Features of Reference Images for Novel Object Pose Estimation
- Kaleidoscopic Background Attack: Disrupting Pose Estimation with Multi-Fold Radial Symmetry Textures
:star:code - SDFit 3D Object Pose and Shape by Fitting a Morphable SDF to a Single Image
:house:project - 计数
- 重识别
- 6DoF
- GraspCoT Integrating Physical Property Reasoning for 6-DoF Grasping under Flexible Language Instructions
:star:code - Environment-Agnostic Pose Generating Environment-independent Object Representations for 6D Pose Estimation
:star:code :house:project - Ultra-Precision 6DoF Pose Estimation Using 2-D Interpolated Discrete Fourier Transform
- Joint Learning of Pose Regression and Denoising Diffusion with Score Scaling Sampling for Category-level 6D Pose Estimation
- RayPose Ray Bundling Diffusion for Template Views in Unseen 6D Object Pose Estimation
- 6DOPE-GS Online 6D Object Pose Estimation using Gaussian Splatting
- GraspCoT Integrating Physical Property Reasoning for 6-DoF Grasping under Flexible Language Instructions
33.Keypoint Detection(关键点检测)
- 关键点检测
32.Image Registration(图像配准)
31.Image Matching(图像匹配)
- HOMO-Feature Cross-Arbitrary-Modal Image Matching with Homomorphism of Organized Major Orientation
:star:code - Towards Open-World Generation of Stereo Images and Unsupervised Matching
:house:project - ArgMatch Adaptive Refinement Gathering for Efficient Dense Matching
:star:code - CoMatch Dynamic Covisibility-Aware Transformer for Bilateral Subpixel-Level Semi-Dense Image Matching
:star:code - Balanced Image Stylization with Style Matching Score
- Feature Matching(特征匹配)
- Mind the Gap: Aligning Vision Foundation Models to Image Feature Matching
- CasP: Improving Semi-Dense Feature Matching Pipeline Leveraging Cascaded Correspondence Priors for Guidance
:star:code - DATA: Domain-And-Time Alignment for High-Quality Feature Fusion in Collaborative Perception
:star:code - Focal Plane Visual Feature Generation and Matching on a Pixel Processor Array
- Towards Efficient General Feature Prediction in Masked Skeleton Modeling
- Learning Dense Feature Matching via Lifting Single 2D Image to 3D Space
- SGAD Semantic and Geometric-aware Descriptor for Local Feature Matching
- EDM Efficient Deep Feature Matching
:star:code
30.Image Fusion(图像融合)
- DreamFuse Adaptive Image Fusion with Diffusion Transformer
- MMAIF Multi-task and Multi-degradation All-in-One for Image Fusion with Language Guidance
- Balancing Task-invariant Interaction and Task-specific Adaptation for Unified Image Fusion
- Revisiting Image Fusion for Multi-Illuminant White-Balance Correction
- Retinex-MEF Retinex-based Glare Effects Aware Unsupervised Multi-Exposure Image Fusion
:star:code - Highlight What You Want Weakly-Supervised Instance-Level Controllable Infrared-Visible Image Fusion
:star:code - AMDANet Attention-Driven Multi-Perspective Discrepancy Alignment for RGB-Infrared Image Fusion and Segmentation
- LUT-Fuse Towards Extremely Fast Infrared and Visible Image Fusion via Distillation to Learnable Look-Up Tables
:star:code - The Source Image is the Best Attention for Infrared and Visible Image Fusion
29.Deepfake Detection/AI生成图像检测
- Seeing Through Deepfakes A Human-Inspired Framework for Multi-Face Detection
- Generalization-Preserved Learning Closing the Backdoor to Catastrophic Forgetting in Continual Deepfake Detection
- Open-Unfairness Adversarial Mitigation for Generalized Deepfake Detection
:star:code - 图像伪造定位/检测
- M2SFormer: Multi-Spectral and Multi-Scale Attention with Edge-Aware Difficulty Guidance for Image Forgery Localization
- Spatial-Temporal Forgery Trace based Forgery Image Identification
- ForgeLens Data-Efficient Forgery Focus for Generalizable Forgery Image Detection
- Semantic Discrepancy-aware Detector for Image Forgery Identification
:star:code - FakeRadar Probing Forgery Outliers to Detect Unknown Deepfake Videos
- ADCD-Net Robust Document Image Forgery Localization via Adaptive DCT Feature and Hierarchical Content Disentanglement
:star:code
- AI生成图片检测
- AIGI-Holmes: Towards Explainable and Generalizable AI-Generated Image Detection via Multimodal Large Language Models
- Hate in Plain Sight: On the Risks of Moderating AI-Generated Hateful Illusions
- D3 Training-Free AI-Generated Video Detection Using Second-Order Features
:star:code* Bridging the Gap Between Ideal and Real-world Evaluation Benchmarking AI-Generated Image Detection in Challenging Scenarios - LOTA Bit-Planes Guided AI-Generated Image Detection
:star:code
- 视频伪造检测
- HumanSAM: Classifying Human-centric Forgery Videos in Human Spatial, Appearance, and Motion Anomaly
:star:code - Beyond Spatial Frequency: Pixel-wise Temporal Frequency-based Deepfake Video Detection
:star:code - Vulnerability-Aware Spatio-Temporal Learning for Generalizable Deepfake Video Detection
:star:code - DeepShield Fortifying Deepfake Video Detection with Local and Global Forgery Analysis
- HumanSAM: Classifying Human-centric Forgery Videos in Human Spatial, Appearance, and Motion Anomaly
- 合成图像检测
- LEGION Learning to Ground and Explain for Synthetic Image Detection
:house:project - Forensic-MoE Exploring Comprehensive Synthetic Image Detection Traces with Mixture of Experts
:star:code - MCID Multi-aspect Copyright Infringement Detection for Generated Images
- Diffusion Epistemic Uncertainty with Asymmetric Learning for Diffusion-Generated Image Detection
- LEGION Learning to Ground and Explain for Synthetic Image Detection
- 复制图片检测
28.Optical Flow Estimation(光流估计)
- MEMFOF: High-Resolution Training for Memory-Efficient Multi-Frame Optical Flow Estimation
:star:code - FlowSeek Optical Flow Made Easier with Depth Foundation Models and Motion Bases
- PriOr-Flow Enhancing Primitive Panoramic Optical Flow with Orthogonal View
- Flow4Agent Long-form Video Understanding via Motion Prior from Optical Flow
- EMatch A Unified Framework for Event-based Optical Flow and Stereo Matching
- Removing Cost Volumes from Optical Flow Estimators
- Unsupervised Joint Learning of Optical Flow and Intensity with Event Cameras
:star:code :house:project
27.Visual Question Answering(视觉问答)
- SplatTalk 3D VQA with Gaussian Splatting
- ReasonVQA: A Multi-hop Reasoning Benchmark with Structural Knowledge for Visual Question Answering
- ToolVQA A Dataset for Multi-step Reasoning VQA with External Tools
:star:code - SimpleVQA Multimodal Factuality Evaluation for Multimodal Large Language Models
- Overcoming Dual Drift for Continual Long-Tailed Visual Question Answering
- Ask and Remember A Questions-Only Replay Strategy for Continual Visual Question Answering
- Video-QA
- 数学问题解决
26.Robot
- GeoDistill: Geometry-Guided Self-Distillation for Weakly Supervised Cross-View Localization
:star:code - AR-1-to-3 Single Image to Consistent 3D Object via Next-View Prediction
- RoboFactory Exploring Embodied Agent Collaboration with Compositional Constraints
- Embodied Representation Alignment with Mirror Neurons
- UnrealZoo Enriching Photo-realistic Virtual Worlds for Embodied AI
- NormalLoc Visual Localization on Textureless 3D Models using Surface Normals
- Semantic-guided Camera Ray Regression for Visual Localization
- 虚拟试穿
- OmniVTON: Training-Free Universal Virtual Try-On
:star:code - PromptDresser Improving the Quality and Controllability of Virtual Try-On via Generative Textual Prompt and Prompt-aware Mask
- Learning Implicit Features with Flow-Infused Transformations for Realistic Virtual Try-On
- All Parts Matter A Unified Mask-Free Virtual Try-On Framework
- TryOn-Refiner Conditional Rectified-flow-based TryOn Refiner for More Accurate Detail Reconstruction
- OmniVTON: Training-Free Universal Virtual Try-On
- 机器人
- RoboPearls: Editable Video Simulation for Robot Manipulation
- AR-VRM: Imitating Human Motions for Visual Robot Manipulation with Analogical Reasoning
:star:code - Adaptive Articulated Object Manipulation On The Fly with Foundation Model Reasoning and Part Grounding
- Recognizing Actions from Robotic View for Natural Human-Robot Interaction
:star:code - Moto Latent Motion Token as the Bridging Language for Learning Robot Manipulation from Videos
- IRASim A Fine-Grained World Model for Robot Manipulation
:house:project - On-Device Diffusion Transformer Policy for Efficient Robot Manipulation
- RAGNet: Large-scale Reasoning-based Affordance Segmentation Benchmark towards General Grasping
:star:code - PASG: A Closed-Loop Framework for Automated Geometric Primitive Extraction and Semantic Anchoring in Robotic Manipulation
- Learning Precise Affordances from Egocentric Videos for Robotic Manipulation
:house:project - iManip Skill-Incremental Learning for Robotic Manipulation
- RoBridge A Hierarchical Architecture Bridging Cognition and Execution for General Robotic Manipulation
- EC-Flow Enabling Versatile Robotic Manipulation from Action-Unlabeled Videos via Embodiment-Centric Flow
:house:project - A0 An Affordance-Aware Hierarchical Model for General Robotic Manipulation
- Rethinking Bimanual Robotic Manipulation Learning with Decoupled Interaction Framework
- GWM Towards Scalable Gaussian World Models for Robotic Manipulation
- FedVLA Federated Vision-Language-Action Learning with Dual Gating Mixture-of-Experts for Robotic Manipulation
- Saliency-Aware Quantized Imitation Learning for Efficient Robotic Control
- Learning 4D Embodied World Models
- VLABench A Large-Scale Benchmark for Language-Conditioned Robotics Manipulation with Long-Horizon Reasoning Tasks
- RobAVA A Large-scale Dataset and Baseline Towards Video based Robotic Arm Action Understanding
- RoboAnnotatorX A Comprehensive and Universal Annotation Framework for Accurate Understanding of Long-horizon Robot Demonstration
- 4D Visual Pre-training for Robot Learning
:house:project - G-DexGrasp Generalizable Dexterous Grasping Synthesis Via Part-Aware Prior Retrieval and Prior-Assisted Generation
- DexH2R A Benchmark for Dynamic Dexterous Grasping in Human-to-Robot Handover
- OVA-Fields Weakly Supervised Open-Vocabulary Affordance Fields for Robot Operational Part Detection* AnyBimanual Transferring Unimanual Policy for General Bimanual Manipulation
- DyWA Dynamics-adaptive World Action Model for Generalizable Non-prehensile Manipulation
- SD2Actor Continuous State Decomposition via Diffusion Embeddings for Robotic Manipulation
- Diffusion-Based Imaginative Coordination for Bimanual Manipulation
:star:code - RoboTron-Mani All-in-One Multimodal Large Model for Robotic Manipulation
- Object Discovery
- SLAM
- Outdoor Monocular SLAM with Global Scale-Consistent 3D Gaussian Pointmaps
:star:code - DyGS-SLAM Real-Time Accurate Localization and Gaussian Reconstruction for Dynamic Scenes
- 4D Gaussian Splatting SLAM
- ToF-Splatting Dense SLAM using Sparse Time-of-Flight Depth and Multi-Frame Integration
- Benchmarking Egocentric Visual-Inertial SLAM at City Scale
- SuperEvent Cross-Modal Learning of Event-based Keypoint Detection for SLAM
:house:project - SEGS-SLAM Structure-enhanced 3D Gaussian Splatting SLAM with Appearance Embedding
:house:project - Underwater Visual SLAM with Depth Uncertainty and Medium Modeling
- Outdoor Monocular SLAM with Global Scale-Consistent 3D Gaussian Pointmaps
- 导航
- IGL-Nav: Incremental 3D Gaussian Localization for Image-goal Navigation
:star:code
:star:code - Embodied Navigation with Auxiliary Task of Action Description Prediction
- EmbodiedSplat Personalized Real-to-Sim-to-Real Navigation with Gaussian Splats from a Mobile Device
:house:project - GUIOdyssey A Comprehensive Dataset for Cross-App GUI Navigation on Mobile Devices
- Learning on the Go A Meta-learning Object Navigation Model
- RoboTrom-Nav A Unified Framework for Embodied Navigation Integrating Perception Planning and Prediction
- MoMa-Kitchen A 100K Benchmark for Affordance-Grounded Last-Mile Navigation in Mobile Manipulation
- Active Perception Meets Rule-Guided RL A Two-Phase Approach for Precise Object Navigation in Complex Environments
:star:code - SAME Learning Generic Language-Guided Visual Navigation with State-Adaptive Mixture of Experts
- Collaborative Instance Object Navigation Leveraging Uncertainty-Awareness to Minimize Human-Agent Dialogues
- CogNav Cognitive Process Modeling for Object Goal Navigation with LLMs
- DialNav Multi-turn Dialog Navigation with a Remote Guide
:house:project - LookOut Real-World Humanoid Egocentric Navigation
- CityNav A Large-Scale Dataset for Real-World Aerial Navigation
- Function-centric Bayesian Network for Zero-Shot Object Goal Navigation
- IGL-Nav: Incremental 3D Gaussian Localization for Image-goal Navigation
- 视觉位置识别
25.Human-Object Interaction Detection(人机交互)
- Prompt Guidance and Human Proximal Perception for HOT Prediction with Regional Joint Loss
:star:code - Bilateral Collaboration with Large Vision-Language Models for Open Vocabulary Human-Object Interaction Detection
- Task-Oriented Human Grasp Synthesis via Context- and Task-Aware Diffusers
:star:code - HOLa: Zero-Shot HOI Detection with Low-Rank Decomposed VLM Feature Adaptation
:star:code - Contact-Aware Amodal Completion for Human-Object Interaction via Multi-Regional Inpainting
- SyncDiff Synchronized Motion Diffusion for Multi-Body Human-Object Interaction Synthesis
- PrimHOI Compositional Human-Object Interaction via Reusable Primitives
- No More Sibling Rivalry Debiasing Human-Object Interaction Detection
- Human-Object Interaction from Human-Level Instructions
- Open-Vocabulary HOI Detection with Interaction-aware Prompt and Concept Calibration
:star:code - Visual Relation Diffusion for Human-Object Interaction Detection
- HUMOTO A 4D Dataset of Mocap Human Object Interactions
:house:project :house:project - ScoreHOI Physically Plausible Reconstruction of Human-Object Interaction via Score-Guided Diffusion
- 手物交互
- 与场景交互
24.Autonomous Driving(自动驾驶)
- CoDa-4DGS Dynamic Gaussian Splatting with Context and Deformation Awareness for Autonomous Driving
- Epona: Autoregressive Diffusion World Model for Autonomous Driving
:star:code
:star:code - AD-GS: Object-Aware B-Spline Gaussian Splatting for Self-Supervised Autonomous Driving
- World4Drive: End-to-End Autonomous Driving via Intention-aware Physical Latent World Model
:star:code - 3D Gaussian Splatting Driven Multi-View Robust Physical Adversarial Camouflage Generation
:star:code - Towards Accurate and Efficient 3D Object Detection for Autonomous Driving: A Mixture of Experts Computing System on Edge
- GS-Occ3D: Scaling Vision-only Occupancy Reconstruction for Autonomous Driving with Gaussian Splatting
:star:code - From Gaze to Movement Predicting Visual Attention for Autonomous Driving Human-Machine Interaction based on Programmatic Imitation Learning
- MamV2XCalib: V2X-based Target-less Infrastructure Camera Calibration with State Space Model
:star:code - VLR-Driver Large Vision-Language-Reasoning Models for Embodied Autonomous Driving
- RoboTron-Sim: Improving Real-World Driving via Simulated Hard-Case
:star:code
:star:code - OD-RASE Ontology-Driven Risk Assessment and Safety Enhancement for Autonomous Driving
- UniMLVG Unified Framework for Multi-view Long Video Generation with Comprehensive Control Capabilities for Autonomous Driving
- MagicDrive-V2 High-Resolution Long Video Generation for Autonomous Driving with Adaptive Control
:house:project :house:project - UniOcc A Unified Benchmark for Occupancy Forecasting and Prediction in Autonomous Driving
:house:project - ConsistentCity Semantic Flow-guided Occupancy DiT for Temporally Consistent Driving Scene Synthesis
- U-ViLAR Uncertainty-Aware Visual Localization for Autonomous Driving via Differentiable Association and Registration
- Towards Visual Localization Interoperability Cross-Feature for Collaborative Visual Localization and Mapping
- SynAD Enhancing Real-World End-to-End Autonomous Driving Models through Synthetic Data Integration
- Fine-Grained Evaluation of Large Vision-Language Models in Autonomous Driving
:star:code - ORION A Holistic End-to-End Autonomous Driving Framework by Vision-Language Instructed Action Generation
- Unraveling the Effects of Synthetic Data on End-to-End Autonomous Driving
- DriveX Omni Scene Modeling for Learning Generalizable World Knowledge in Autonomous Driving
- CoLMDriver LLM-based Negotiation Benefits Cooperative Autonomous Driving
:star:code - AdaDrive Self-Adaptive Slow-Fast System for Language-Grounded Autonomous Driving
:star:code - HiP-AD Hierarchical and Multi-Granularity Planning with Deformable Attention for Autonomous Driving in a Single Decoder
- TAD-E2E A Large-scale End-to-end Autonomous Driving Dataset
- DistillDrive End-to-End Multi-Mode Autonomous Driving Distillation by Isomorphic Hetero-Source Planning Model
:star:code - CARIM Caption-Based Autonomous Driving Scene Retrieval via Inclusive Text Matching
- Passing the Driving Knowledge Test
- DriveArena A Closed-loop Generative Simulation Platform for Autonomous Driving
- Hints of Prompt Enhancing Visual Representation for Multimodal LLMs in Autonomous Driving
- Are VLMs Ready for Autonomous Driving An Empirical Study from the Reliability Data and Metric Perspectives
- VLDrive Vision-Augmented Lightweight MLLMs for Efficient Language-grounded Autonomous Driving
:star:code - RoboTron-Drive All-in-One Large Multimodal Model for Autonomous Driving
- ReAL-AD Towards Human-Like Reasoning in End-to-End Autonomous Driving
- V2XPnP Vehicle-to-Everything Spatio-Temporal Fusion for Multi-Agent Perception and Prediction
- ETA Efficiency through Thinking Ahead A Dual Approach to Self-Driving with Large Models
- DrivingGPT Unifying Driving World Modeling and Planning with Multi-modal Autoregressive Transformers
- V2XScenes A Multiple Challenging Traffic Conditions Dataset for Large-Range Vehicle-Infrastructure Collaborative Perception
- Hydra-NeXt Robust Closed-Loop Driving with Open-Loop Training
- Driving View Synthesis on Free-form Trajectories with Generative Prior
- DiST-4D Disentangled Spatiotemporal Diffusion with Metric Depth for 4D Driving Scene Generation
:house:project - 三维占据
- Occupancy Learning with Spatiotemporal Memory
:star:code - Semantic Causality-Aware Vision-Based 3D Occupancy Prediction
- GaussRender Learning 3D Occupancy with Gaussian Rendering
- SA-Occ Satellite-Assisted 3D Occupancy Prediction in Real World
- EmbodiedOcc Embodied 3D Occupancy Prediction for Vision-based Online Scene Understanding
:star:code - AGO Adaptive Grounding for Open World 3D Occupancy Prediction
:star:code - GaussianFlowOcc Sparse and Weakly Supervised Occupancy Estimation using Gaussian Splatting and Temporal Flow
- GaussianOcc Fully Self-supervised and Efficient 3D Occupancy Estimation with Gaussian Splatting
- Occupancy Learning with Spatiotemporal Memory
- 轨迹预测
- Foresight in Motion: Reinforcing Trajectory Prediction with Reward Heuristics
- Generative Active Learning for Long-tail Trajectory Prediction via Controllable Diffusion Model
- End-to-End Driving with Online Trajectory Evaluation via BEV World Model
:star:code - TOTP Transferable Online Pedestrian Trajectory Prediction with Temporal-Adaptive Mamba Latent Diffusion
- Resonance Learning to Predict Social-Aware Pedestrian Trajectories as Co-Vibrations
- NATRA Noise-Agnostic Framework for Trajectory Prediction with Noisy Observations
- DONUT A Decoder-Only Model for Trajectory Prediction
- VLA
- CoA-VLA Improving Vision-Language-Action Models via Visual-Text Chain-of-Affordance
- VQ-VLA: Improving Vision-Language-Action Models via Scaling Vector-Quantized Action Tokenizers
:star:code - CombatVLA An Efficient Vision-Language-Action Model for Combat Tasks in 3D Action Role-Playing Games
:house:project - Dita Scaling Diffusion Transformer for Generalist Vision-Language-Action Policy
- Exploring the Adversarial Vulnerabilities of Vision-Language-Action Models in Robotics
- Towards Long-Horizon Vision-Language-Action System Reasoning Acting and Memory
- 占用预测
- Language Driven Occupancy Prediction
- MergeOcc Bridge the Domain Gap between Different LiDARs for Robust Occupancy Prediction
- MCOP Multi-UAV Collaborative Occupancy Prediction
- RIOcc Efficient Cross-Modal Fusion Transformer with Collaborative Feature Refinement for 3D Semantic Occupancy Prediction
- ALOcc Adaptive Lifting-Based 3D Semantic Occupancy and Cost Volume-Based Flow Predictions
- 重识别
- 车道线检测
- 车辆监控
23.Point Cloud(点云)
- HVPUNet Hybrid-Voxel Point-cloud Upsampling Network
- GAP: Gaussianize Any Point Clouds with Text Guidance
:star:code
:star:code - StruMamba3D: Exploring Structural Mamba for Self-supervised Point Cloud Representation Learning
- PointGAC: Geometric-Aware Codebook for Masked Point Cloud Modeling
:star:code - Harnessing Text-to-Image Diffusion Models for Point Cloud Self-Supervised Learning
:star:code - LINR-PCGC: Lossless Implicit Neural Representations for Point Cloud Geometry Compression
:star:code - Blended Point Cloud Diffusion for Localized Text-guided Shape Editing
:star:code - UPP: Unified Point-Level Prompting for Robust Point Cloud Analysis
:star:code - Efficient Spiking Point Mamba for Point Cloud Analysis
:star:code - Guiding Diffusion-Based Articulated Object Generation by Partial Point Cloud Alignment and Physical Plausibility Constraints
- Robust 3D Object Detection using Probabilistic Point Clouds from Single-Photon LiDARs
:star:code
:star:code - UST-SSM Unified Spatio-Temporal State Space Models for Point Cloud Video Modeling
:star:code - Omni-scene Perception-oriented Point Cloud Geometry Enhancement for Coordinate Quantization
- A Constrained Optimization Approach for Gaussian Splatting from Coarsely-posed Images and Noisy Lidar Point Clouds
- GenFlow3D Generative Scene Flow Estimation and Prediction on Point Cloud Sequences
:star:code - Feature Extraction and Representation of Pre-training Point Cloud Based on Diffusion Models
- Leaps and Bounds An Improved Point Cloud Winding Number Formulation for Fast Normal Estimation and Surface Reconstruction
- Serialization based Point Cloud Oversegmentation
:star:code - Towards More Diverse and Challenging Pre-training for Point Cloud Learning Self-Supervised Cross Reconstruction with Decoupled Views
:star:code - Liberated-GS 3D Gaussian Splatting Independent from SfM Point Clouds
- CAD-Recode Reverse Engineering CAD Code from Point Clouds
- Constraint-Aware Feature Learning for Parametric Point Cloud
- Egocentric Action-aware Inertial Localization in Point Clouds with Vision-Language Guidance
- DiffPCI Large Motion Point Cloud frame Interpolation with Diffusion Model
- Mixed Signals A Diverse Point Cloud Dataset for Heterogeneous LiDAR V2X Collaboration
- DAP-MAE Domain-Adaptive Point Cloud Masked Autoencoder for Effective Cross-Domain Learning
:star:code - Interpretable point cloud classification using multiple instance learning
- CounterPC Counterfactual Feature Realignment for Unsupervised Domain Adaptation on Point Clouds
- Partially Matching Submap Helps Uncertainty Modeling and Propagation for Text to Point Cloud Localization
:star:code - DiffRefine Diffusion-based Proposal Specific Point Cloud Densification for Cross-Domain Object Detection
- Point Cloud Self-supervised Learning via 3D to Multi-view Masked Learner
- RARE Refine Any Registration of Pairwise Point Clouds via Zero-Shot Learning
:star:code - 3D 点云
- Exploiting Vision Language Model for Training-Free 3D Point Cloud OOD Detection via Graph Score Propagation
- FastPoint: Accelerating 3D Point Cloud Model Inference via Sample Point Distance Prediction
- ForestFormer3D A Unified Framework for End-to-End Segmentation of Forest LiDAR 3D Point Clouds
:house:project - Domain-aware Category-level Geometry Learning Segmentation for 3D Point Clouds
:star:code - GroundFlow A Plug-in Module for Temporal Reasoning on 3D Point Cloud Sequential Grounding
- Tree Skeletonization from 3D Point Clouds by Denoising Diffusion
- TrackAny3D Transferring Pretrained 3D Models for Category-unified 3D Point Cloud Tracking
- 点云配准
- TurboReg: TurboClique for Robust and Efficient Point Cloud Registration
:star:code - DiffI2P: Differentiable Image-to-Point Cloud Registration with Diffusion Prior
- Unsupervised RGB-D Point Cloud Registration for Scenes with Low Overlap and Photometric Inconsistency
- BUFFER-X Towards Zero-Shot Point Cloud Registration in Diverse Scenes
:star:code
- TurboReg: TurboClique for Robust and Efficient Point Cloud Registration
- 点云分割
- 点云补全
- 点云分类
- 点云去噪
22.3D
- VertexRegen: Mesh Generation with Continuous Level of Detail
:star:code - GeoProg3D: Compositional Visual Reasoning for City-Scale 3D Language Fields
:star:code - ViewSRD: 3D Visual Grounding via Structured Multi-View Decomposition
:star:code - Top2Pano: Learning to Generate Indoor Panoramas from Top-Down View
:star:code - PanoSplatt3R: Leveraging Perspective Pretraining for Generalized Unposed Wide-Baseline Panorama Reconstruction
:star:code - Stable-Sim2Real: Exploring Simulation of Real-Captured 3D Data with Two-Stage Depth Diffusion
:star:code
:star:code - OmniDepth: Bridging Monocular and Stereo Reasoning with Latent Alignment
:star:code - LightSwitch: Multi-view Relighting with Material-guided Diffusion
:star:code
:star:code - How Far are AI-generated Videos from Simulating the 3D Visual World: A Learned 3D Evaluation Approach
- Diorama Unleashing Zero-shot Single-view 3D Indoor Scene Modeling
- Articulate3D Holistic Understanding of 3D Scenes as Universal Scene Description
- HouseCrafter Lifting Floorplans to 3D Scenes with 2D Diffusion Models
- Uncertainty-Aware Diffusion-Guided Refinement of 3D Scenes
- Learning 3D Scene Analogies with Neural Contextual Scene Maps
:house:project - SuperDec 3D Scene Decomposition with Superquadrics Primitives
- SAS Segment Any 3D Scene with Integrated 2D Priors
- Bolt3D Generating 3D Scenes in Seconds
- GLEAM Learning Generalizable Exploration Policy for Active Mapping in Complex 3D Indoor Scene
- Event-Driven Storytelling with Multiple Lifelike Humans in a 3D Scene
- Can3Tok Canonical 3D Tokenization and Latent Modeling of Scene-Level 3D Gaussians
- Baking Gaussian Splatting into Diffusion Denoiser for Fast and Scalable Single-stage Image-to-3D Generation and Reconstruction
:house:project - Generative Gaussian Splatting Generating 3D Scenes with Video Diffusion Priors
- MaGS Reconstructing and Simulating Dynamic 3D Objects with Mesh-adsorbed Gaussian Splatting
- GS-Occ3D Scaling Vision-only Occupancy Reconstruction with Gaussian Splatting
:house:project - PhysSplat Efficient Physics Simulation for 3D Scenes via MLLM-Guided Gaussian Splatting
- CCL-LGS Contrastive Codebook Learning for 3D Language Gaussian Splatting
:house:project - InsideOut Integrated RGB-Radiative Gaussian Splatting for Comprehensive 3D Object Representation
- 表面重建
- SparseRecon: Neural Implicit Surface Reconstruction from Sparse Views with Feature and Depth Consistencies
:star:code - RayletDF: Raylet Distance Fields for Generalizable 3D Surface Reconstruction from Point Clouds or Gaussians
:star:code - PolGS Polarimetric Gaussian Splatting for Fast Reflective Surface Reconstruction
- Quadratic Gaussian Splatting High Quality Surface Reconstruction with Second-order Geometric Primitives
- Drawing Developmental Trajectory from Cortical Surface Reconstruction
- SurfaceSplat Connecting Surface Reconstruction and Gaussian Splatting
:star:code - QuickSplat Fast 3D Surface Reconstruction via Learned Gaussian Initialization
- GSRecon Efficient Generalizable Gaussian Splatting for Surface Reconstruction from Sparse Views
:star:code - GCRayDiffusion Pose-Free Surface Reconstruction via Geometric Consistent Ray Diffusion
- MGSR 2D3D Mutual-boosted Gaussian Splatting for High-fidelity Surface Reconstruction under Various Light Conditions
:star:code
- SparseRecon: Neural Implicit Surface Reconstruction from Sparse Views with Feature and Depth Consistencies
- 三维重建
- Curve-Aware Gaussian Splatting for 3D Parametric Curve Reconstruction
:star:code - InstaScene: Towards Complete 3D Instance Decomposition and Reconstruction from Cluttered Scenes
:star:code - Image-Guided Shape-from-Template Using Mesh Inextensibility Constraints
:star:code - MeshMamba: State Space Models for Articulated 3D Mesh Generation and Reconstruction
- LONG3R: Long Sequence Streaming 3D Reconstruction
:star:code - H3R: Hybrid Multi-view Correspondence for Generalizable 3D Reconstruction
:star:code - Ross3D Reconstructive Visual Instruction Tuning with 3D-Awareness
- FlowR Flowing from Sparse to Dense 3D Reconstructions
- Dream-to-Recon Monocular 3D Reconstruction with Diffusion-Depth Distillation from Single Images
- POMATO Marrying Pointmap Matching with Temporal Motions for Dynamic 3D Reconstruction
:star:code - Amodal3R Amodal 3D Reconstruction from Occluded 2D Images
- DeGauss Dynamic-Static Decomposition with Gaussian Splatting for Distractor-free 3D Reconstruction
- Hi-Gaussian Hierarchical Gaussians under Normalized Spherical Projection for Single-View 3D Reconstruction
- Explaining Human Preferences via Metrics for Structured 3D Reconstruction
- Boost 3D Reconstruction using Diffusion-based Monocular Camera Calibration
- Dynamic Point Maps A Versatile Representation for Dynamic 3D Reconstruction
- RadarSplat Radar Gaussian Splatting for High-Fidelity Data Synthesis and 3D Reconstruction of Autonomous Driving Scenes
- HAMSt3R Human-Aware Multi-view Stereo 3D Reconstruction
- ArchiSet Benchmarking Editable and Consistent Single-View 3D Reconstruction of Buildings with Specific Window-to-Wall Ratios
- FreeSplatter Pose-free Gaussian Splatting for Sparse-view 3D Reconstruction
- SketchSplat 3D Edge Reconstruction via Differentiable Multi-view Sketch Splatting
- Inverse 3D Microscopy Rendering for Cell Shape Inference with Active Mesh
- Real3D Towards Scaling Large Reconstruction Models with Real Images
- TimeFormer Capturing Temporal Relationships of Deformable 3D Gaussians for Robust Reconstruction
- Mamba-3VL Taming State Space Model for 3D Vision Language Learning
- AAA-Gaussians Anti-Aliased and Artifact-Free 3D Gaussian Rendering
- Curve-Aware Gaussian Splatting for 3D Parametric Curve Reconstruction
- 场景重建
- BézierGS: Dynamic Urban Scene Reconstruction with Bézier Curve Gaussian Splatting
:star:code - ObjectGS: Object-aware Scene Reconstruction and Scene Understanding via Gaussian Splatting
:star:code - DepR: Depth Guided Single-view Scene Reconstruction with Instance-level Diffusion
- Momentum-GS Momentum Gaussian Self-Distillation for High-Quality Large Scene Reconstruction
- Splat-based 3D Scene Reconstruction with Extreme Motion-blur
- Puzzle Similarity A Perceptually-guided Cross-Reference Metric for Artifact Detection in 3D Scene Reconstructions
:house:project - ExploreGS: Explorable 3D Scene Reconstruction with Virtual Camera Samplings and Diffusion Priors
:star:code
:star:code - Self-Supervised Monocular 4D Scene Reconstruction for Egocentric Videos
:house:project - BezierGS Dynamic Urban Scene Reconstruction with Bezier Curve Gaussian Splatting
- S3R-GS Streamlining the Pipeline for Large-Scale Street Scene Reconstruction
:star:code - RGE-GS Reward-Guided Expansive Driving Scene Reconstruction via Diffusion Priors
- SpatialCrafter Unleashing the Imagination of Video Diffusion Models for Scene Reconstruction from Limited Observations
- ClaraVid A Holistic Scene Reconstruction Benchmark From Aerial Perspective With Delentropy-Based Complexity Profiling
:house:project - Diffusion-Based Extreme High-speed Scenes Reconstruction with the Complementary Vision Sensor
:star:code - CityGS-X A Scalable Architecture for Efficient and Geometrically Accurate Large-Scale Scene Reconstruction
- Event-boosted Deformable 3D Gaussians for Dynamic Scene Reconstruction
- VistaDream Sampling multiview consistent images for single-view scene reconstruction
- Hierarchy UGP Hierarchy Unified Gaussian Primitive for Large-Scale Dynamic Scene Reconstruction
- Back on Track Bundle Adjustment for Dynamic Scene Reconstruction
- Geo4D Leveraging Video Generators for Geometric 4D Scene Reconstruction
- Humans as a Calibration Pattern Dynamic 3D Scene Reconstruction from Unsynchronized and Uncalibrated Videos
- Proactive Scene Decomposition and Reconstruction
- Scene Coordinate Reconstruction Priors
- BézierGS: Dynamic Urban Scene Reconstruction with Bézier Curve Gaussian Splatting
- 三维场景理解
- VoteSplat: Hough Voting Gaussian Splatting for 3D Scene Understanding
:star:code - Open-Vocabulary Octree-Graph for 3D Scene Understanding
:star:code - HERMES A Unified Self-Driving World Model for Simultaneous 3D Scene Understanding and Generation
:star:code - 3DGraphLLM Combining Semantic Graphs and Large Language Models for 3D Scene Understanding
:star:code - ExCap3D Expressive 3D Scene Understanding via Object Captioning with Varying Detail
- NuPlanQA A Large-Scale Dataset and Benchmark for Multi-View Driving Scene Understanding in Multi-Modal Large Language Models
:star:code - Hierarchical 3D Scene Graphs Construction Outdoors
- AG2aussian Anchor-Graph Structured Gaussian Splatting for Instance-Level 3D Scene Understanding and Editing
- Embodied VideoAgent Persistent Memory from Egocentric Videos and Embodied Sensors Enables Dynamic Scene Understanding
- OURO A Self-Bootstrapped Framework for Enhancing Multimodal Scene Understanding
:star:code - SceneSplat Gaussian Splatting-based Scene Understanding with Vision-Language Pretraining
- VoteSplat: Hough Voting Gaussian Splatting for 3D Scene Understanding
- 深度估计
- DepthSync: Diffusion Guidance-Based Depth Synchronization for Scale- and Geometry-Consistent Video Depth Estimation
- Hybrid-grained Feature Aggregation with Coarse-to-fine Language Guidance for Self-supervised Monocular Depth Estimation
- One Look is Enough Seamless Patchwise Refinement for Zero-Shot Monocular Depth Estimation on High-Resolution Images
- FiffDepth Feed-forward Transformation of Diffusion-Based Generators for Detailed Depth Estimation
- GVDepth Zero-Shot Monocular Depth Estimation for Ground Vehicles based on Probabilistic Cue Fusion
:house:project :house:project - Extending Foundational Monocular Depth Estimators to Fisheye Cameras with Calibration Tokens
:star:code - Depth Any Event Stream Enhancing Event-based Monocular Depth Estimation via Dense-to-Sparse Distillation
- Depth AnyEvent A Cross-Modal Distillation Paradigm for Event-Based Monocular Depth Estimation
- StableDepth Scene-Consistent and Scale-Invariant Monocular Depth
- Spherical Epipolar Rectification for Deep Two-View Absolute Depth Estimation
- Seeing and Seeing Through the Glass Real and Synthetic Data for Multi-Layer Depth Estimation
- S2M2 Scalable Stereo Matching Model for Reliable Depth Estimation
- Hyper-Depth Hypergraph-based Multi-Scale Representation Fusion for Monocular Depth Estimation
- FlashDepth Real-time Streaming Video Depth Estimation at 2K Resolution
:star:code - Simulating Dual-Pixel Images From Ray Tracing For Depth Estimation
- Amodal Depth Anything Amodal Depth Estimation in the Wild
- 深度补全
- PacGDC: Label-Efficient Generalizable Depth Completion with Projection Ambiguity and Consistency
:star:code - Test-Time Prompt Tuning for Zero-Shot Depth Completion
- ETA Energy-based Test-time Adaptation for Depth Completion
- OMNI-DC Highly Robust Depth Completion with Multiresolution Depth Integration
:star:code - HFD-Teacher High-Frequency Depth Distillation from Depth Foundation Models for Enhanced Depth Completion
- Marigold-DC Zero-Shot Monocular Depth Completion with Guided Diffusion
:house:project
- PacGDC: Label-Efficient Generalizable Depth Completion with Projection Ambiguity and Consistency
- SM
- RobuSTereo: Robust Zero-Shot Stereo Matching under Adverse Weather
- BANet Bilateral Aggregation Network for Mobile Stereo Matching
- ZeroStereo Zero-shot Stereo Matching from Single Images
:star:code - Learning Robust Stereo Matching in the Wild with Selective Mixture-of-Experts
:star:code - Diving into the Fusion of Monocular Priors for Generalized Stereo Matching
- Global Regulation and Excitation via Attention Tuning for Stereo Matching
:star:code - MDP-Omni Parameter-free Multimodal Depth Prior-based Sampling for Omnidirectional Stereo Matching
- MVS
- 3DGS
- RegGS: Unposed Sparse Views Gaussian Splatting with 3DGS Registration
:star:code - PCR-GS: COLMAP-Free 3D Gaussian Splatting via Pose Co-Regularizations
- GaussianUpdate: Continual 3D Gaussian Splatting Update for Changing Environments
- ResGS Residual Densification of 3D Gaussian for Efficient Detail Recovery
- RobustSplat Decoupling Densification and Dynamics for Transient-Free 3DGS
:house:project - Robust 3D-Masked Part-level Editing in 3D Gaussian Splatting with Regularized Score Distillation Sampling
:house:project - OCSplats Observation Completeness Quantification and Label Noise Separation in 3DGS
- GSV3D Gaussian Splatting-based Geometric Distillation with Stable Video Diffusion for Single-Image 3D Object Generation
:star:code - GauUpdate New Object Insertion in 3D Gaussian Fields with Consistent Global Illumination
- A Lesson in Splats Teacher-Guided Diffusion for 3D Gaussian Splats Generation with 2D Supervision
- A3GS Arbitrary Artistic Style into Arbitrary 3D Gaussian Splatting
- StochasticSplats Stochastic Rasterization for Sorting-Free 3D Gaussian Splatting
- InterGSEdit Interactive 3D Gaussian Splatting Editing with 3D Geometry-Consistent Attention Prior
- GaRe Relightable 3D Gaussian Splatting for Outdoor Scenes from Unconstrained Photo Collections
- NeRF Is a Valuable Assistant for 3D Gaussian Splatting
- Robust and Efficient 3D Gaussian Splatting for Urban Scene Reconstruction
:house:project - GazeGaussian High-Fidelity Gaze Redirection with 3D Gaussian Splatting
:house:project - LongSplat Robust Unposed 3D Gaussian Splatting for Casual Long Videos
:house:project :house:project - Lightweight Gradient-Aware Upscaling of 3D Gaussian Splatting Images
- StealthAttack Robust 3D Gaussian Splatting Poisoning via Density-Guided Illusions
- SU-RGS Relightable 3D Gaussian Splatting from Sparse Views under Unconstrained Illuminations
- MEGA Memory-Efficient 4D Gaussian Splatting for Dynamic Scenes
:star:code - SplArt Articulation Estimation and Part-Level Reconstruction with 3D Gaussian Splatting
:star:code - CATSplat Context-Aware Transformer with Spatial Guidance for Generalizable 3D Gaussian Splatting from A Single-View Image
- AccidentalGS 3D Gaussian Splatting from Accidental Camera Motion
- No Pose at All Self-Supervised Pose-Free 3D Gaussian Splatting from Sparse Views
:house:project :house:project
- RegGS: Unposed Sparse Views Gaussian Splatting with 3DGS Registration
- Semantic Scene Completion(语义场景补全)
- Feed-Forward SceneDINO for Unsupervised Semantic Scene Completion
:star:code
:star:code - Disentangling Instance and Scene Contexts for 3D Semantic Scene Completion
:star:code - Monocular Semantic Scene Completion via Masked Recurrent Networks
:star:code - SDFormer Vision-based 3D Semantic Scene Completion via SAM-assisted Dual-channel Voxel Transformer
- Global-Aware Monocular Semantic Scene Completion with State Space Models
- VisHall3D Monocular Semantic Scene Completion from Reconstructing the Visible Regions to Hallucinating the Invisible Regions
- Feed-Forward SceneDINO for Unsupervised Semantic Scene Completion
- Scene Completion(场景补全)
- 4D重建
- 场景生成
- ScenePainter Semantically Consistent Perpetual 3D Scene Generation with Concept Relation Alignment
- Free4D Tuning-free 4D Scene Generation with Spatial-Temporal Consistency
- Controllable 3D Outdoor Scene Generation via Scene Graphs
:star:code - InfiniCube Unbounded and Controllable Dynamic 3D Driving Scene Generation with World-Guided Video Models
- MiDSummer Multi-Guidance Diffusion for Controllable Zero-Shot Immersive Gaussian Splatting Scene Generation
- WonderPlay Dynamic 3D Scene Generation from a Single Image and Actions
- VMem Consistent Interactive Video Scene Generation with Surfel-Indexed View Memory
- AutoScape Geometry-Consistent Long-Horizon Scene Generation
- PersonaCraft Personalized and Controllable Full-Body Multi-Human Scene Generation Using Occlusion-Aware 3D-Conditioned Diffusion
- Decoupled Diffusion Sparks Adaptive Scene Generation
- Large Scene Generation with Cube-Absorb Discrete Diffusion
- 场景流估计
21.UAV/RS/Satellite Image(无人机/遥感/卫星图像)
- HUG Hierarchical Urban Gaussian Splatting with Block-Based Reconstruction for Large-Scale Aerial Scenes
- UAVScenes: A Multi-Modal Dataset for UAVs
:star:code - MMGeo Multimodal Compositional Geo-Localization for UAVs
:star:code - Cross-modal Ship Re-Identification via Optical and SAR Imagery: A Novel Dataset and Method
:star:code - LoD-Loc v2: Aerial Visual Localization over Low Level-of-Detail City Models using Explicit Silhouette Alignment
- SkySense V2: A Unified Foundation Model for Multi-modal Remote Sensing
- Adapting Vehicle Detectors for Aerial Imagery to Unseen Domains with Weak Supervision
:star:code - OpenRSD Towards Open-prompts for Object Detection in Remote Sensing Images
- Dual Domain Control via Active Learning for Remote Sensing Domain Incremental Object Detection
- Active Learning Meets Foundation Models Fast Remote Sensing Data Annotation for Object Detection
- LLM-Assisted Semantic Guidance for Sparsely Annotated Remote Sensing Object Detection
- Fusion Meets Diverse Conditions A High-diversity Benchmark and Baseline for UAV-based Multimodal Object Detection with Condition Cues
- RS-vHeat Heat Conduction Guided Efficient Remote Sensing Foundation Model
- SMARTIES Spectrum-Aware Multi-Sensor Auto-Encoder for Remote Sensing Images
:house:project - HoliTracer Holistic Vectorization of Geographic Objects from Large-Size Remote Sensing Imagery
:star:code - Towards Privacy-preserved Pre-training of Remote Sensing Foundation Models with Federated Mutual-guidance Learning
- When Large Vision-Language Model Meets Large Remote Sensing Imagery Coarse-to-Fine Text-Guided Token Pruning
- 卫星
- Harnessing Massive Satellite Imagery with Efficient Masked Image Modeling
- WildSAT Learning Satellite Image Representations from Wildlife Observations
- Sat2City: 3D City Generation from A Single Satellite Image with Cascaded Latent Diffusion
- MagicCity Geometry-Aware 3D City Generation from Satellite Imagery with Multi-View Consistency
- 变化检测
- 目标检测
- 分割
- 无人机
20.OCR
- Beyond Isolated Words: Diffusion Brush for Handwritten Text-Line Generation
:star:code - CTC Transcription Alignment of the Bullinger Letters: Automatic Improvement of Annotation Quality
:star:code - OCR Hinders RAG Evaluating the Cascading Impact of OCR on Retrieval-Augmented Generation
:star:code - CC-OCR A Comprehensive and Challenging OCR Benchmark for Evaluating Large Multimodal Models in Literacy
- MSA2 Multi-task Framework with Structure-aware and Style-adaptive Character Representation for Open-set Chinese Text Recognition
- A Token-level Text Image Foundation Model for Document Understanding
:star:code - 文本生成
- 甲骨文解读
- 文档矫正
- 表格理解
- 场景文本检索
- 场景文本识别
19.Video
- SciVid: Cross-Domain Evaluation of Video Models in Scientific Applications
:star:code - Fine-grained Spatiotemporal Grounding on Egocentric Videos
:star:code - LV-MAE Learning Long Video Representations through Masked-Embedding Autoencoders
- Learning Streaming Video Representation via Multitask Training
- Tree-NeRV Efficient Non-Uniform Sampling for Neural Video Representation via Tree-Structured Feature Grids
:star:code - What Changed and What Could Have Changed State-Change Counterfactuals for Procedure-Aware Video Representation Learning
:star:code - One Trajectory One Token Grounded Video Tokenization via Panoptic Sub-object Trajectory
- FrameFusion Combining Similarity and Importance for Video Token Reduction on Large Vision Language Models
:star:code - Progressive Growing of Video Tokenizers for Temporally Compact Latent Spaces
- StreamMind Unlocking Full Frame Rate Streaming Video Dialogue through Event-Gated Cognition
- 视频理解
- Q-Frame: Query-aware Frame Selection and Multi-Resolution Adaptation for Video-LLMs
- MDP3 A Training-free Approach for List-wise Frame Selection in Video-LLMs
- ARGUS Hallucination and Omission Evaluation in Video-LLMs
- Flash-VStream: Efficient Real-Time Understanding for Long Video Streams
:star:code - AuroraLong: Bringing RNNs Back to Efficient Open-Ended Video Understanding
- MCAM: Multimodal Causal Analysis Model for Ego-Vehicle-Level Driving Video Understanding
:star:code - DynImg: Key Frames with Visual Prompts are Good Representation for Multi-Modal Video Understanding
- VideoLLaMB Long Streaming Video Understanding with Recurrent Memory Bridges
- Beyond Training Dynamic Token Merging for Zero-Shot Video Understanding
:star:code - LVAgent Long Video Understanding by Multi-Round Dynamical Collaboration of MLLM Agents
:star:code - Streaming VideoLLMs for Real-Time Procedural Video Understanding
- From Trial to Triumph Advancing Long Video Understanding via Visual Context Sample Scaling and Self-reward Alignment
- AdsQA Towards Advertisement Video Understanding
- Principles of Visual Tokens for Efficient Video Understanding
:star:code - Open-ended Hierarchical Streaming Video Understanding with Vision Language Models
- VideoAds for Fast-Paced Video Understanding
- VCA Video Curious Agent for Long Video Understanding
- 视频摘要
- 视频时序定位
- Hierarchical Event Memory for Accurate and Low-latency Online Video Temporal Grounding
:star:code - KDA Knowledge Diffusion Alignment with Enhanced Context for Video Temporal Grounding
- Sparse-Dense Side-Tuner for efficient Video Temporal Grounding
:star:code - TimeExpert An Expert-Guided Video LLM for Video Temporal Grounding
- OVG-HQ Online Video Grounding with Hybrid-modal Queries
- Vid-Group Temporal Video Grounding Pretraining from Unlabeled Videos in the Wild
:star:code - Enrich and Detect Video Temporal Grounding with Multimodal LLMs
- VTimeCoT Thinking by Drawing for Video Temporal Grounding and Reasoning
- Hierarchical Event Memory for Accurate and Low-latency Online Video Temporal Grounding
- 视频异常检测
- 视频时刻检索
- 视频帧插值
- 视频预测
- 视频定制
- DreamRelation Relation-Centric Video Customization
:house:project - MagicID Hybrid Preference Optimization for ID-Consistent and Dynamic-Preserved Video Customization
- PersonalVideo High ID-Fidelity Video Customization without Dynamic and Semantic Degradation
- DualReal Adaptive Joint Training for Lossless Identity-Motion Fusion in Video Customization
:house:project
- DreamRelation Relation-Centric Video Customization
18.Person Re-Identification(行人重识别)
- Multi-modal Multi-platform Person Re-Identification Benchmark and Method
- VIPerson Flexibly Generating Virtual Identity for Person Re-Identification
:house:project - One-Shot Knowledge Transfer for Scalable Person Re-Identification
- OpenAnimals Revisiting Person Re-Identification for Animals Towards Better Generalization
- Bridging the Sky and Ground Towards View-Invariant Feature Learning for Aerial-Ground Person Re-Identification
- Prompt-driven Transferable Adversarial Attack on Person Re-Identification with Attribute-aware Textual Inversion
- Cross-Category Subjectivity Generalization for Style-Adaptive Sketch Re-ID
- 基于视频的重识别
- 换衣重识别
- 终身重识别
- 红外可见光
- 行为理解
- 行人检索
- 步态识别
17.Action Recognition(动作识别)
- ProbRes Probabilistic Jump Diffusion for Open-World Egocentric Activity Recognition
- DeSPITE Exploring Contrastive Deep Skeleton-Pointcloud-IMU-Text Embeddings for Advanced Point Cloud Human Activity Understanding
:star:code - Punching Bag vs. Punching Person: Motion Transferability in Videos
:star:code - Learning to Generalize without Bias for Open-Vocabulary Action Recognition
- SAMPLE Semantic Alignment through Temporal-Adaptive Multimodal Prompt Learning for Event-Based Open-Vocabulary Action Recognition
:star:code - Exploiting Frequency Dynamics for Enhanced Multimodal Event-based Action Recognition
- Less Static More Private Towards Transferable Privacy-Preserving Action Recognition by Generative Decoupled Learning
- Dynamic Group Detection using VLM-augmented Temporal Groupness Graph
:star:code - 动作预测
- Efficient Multi-Person Motion Prediction by Lightweight Spatial and Temporal Interactions
:star:code - PriorMotion Generative Class-Agnostic Motion Prediction with Raster-Vector Motion Field Priors
- Gaussian-based World Model Gaussian Priors for Voxel-Based Occupancy Prediction and Future Motion Prediction
- Proxy-Bridged Game Transformer for Interactive Extreme Motion Prediction
:star:code
- Efficient Multi-Person Motion Prediction by Lightweight Spatial and Temporal Interactions
- 小样本动作识别
- 动作分割
- Skeleton Motion Words for Unsupervised Skeleton-Based Temporal Action Segmentation
:star:code - Multi-Modal Few-Shot Temporal Action Segmentation
:star:code - DuoCLR Dual-Surrogate Contrastive Learning for Skeleton-based Human Action Segmentation
- Joint Self-Supervised Video Alignment and Action Segmentation
- CLOT: Closed Loop Optimal Transport for Unsupervised Action Segmentation
- Skeleton Motion Words for Unsupervised Skeleton-Based Temporal Action Segmentation
- 动作检测
- 基于骨架的动作识别
- Frequency-Semantic Enhanced Variational Autoencoder for Zero-Shot Skeleton-based Action Recognition
- Bridging the Skeleton-Text Modality Gap Diffusion-Powered Modality Alignment for Zero-shot Skeleton-based Action Recognition
- Hierarchical-aware Orthogonal Disentanglement Framework for Fine-grained Skeleton-based Action Recognition
- Bridging Class Imbalance and Partial Labeling via Spectral-Balanced Energy Propagation for Skeleton-based Action Recognition
- Adaptive Hyper-Graph Convolution Network for Skeleton-based Human Action Recognition with Virtual Connections
:star:code
- 动作预期(action anticipation)
16.Human Motion
- Cycle Consistency as Reward Learning Image-Text Alignment without Human Preferences
:house:project - IMoRe Implicit Program-Guided Reasoning for Human Motion QA
:star:code - Future-Aware Interaction Network For Motion Forecasting
- Decouple and Track Benchmarking and Improving Video Diffusion Transformers For Motion Transfer
- Privacy-centric Deep Motion Retargeting for Anonymization of Skeleton-Based Motion Visualization
- MagShield Towards Better Robustness in Sparse Inertial Motion Capture Under Magnetic Disturbances
:star:code - GENMO A GENeralist Model for Human MOtion
- KinMo Kinematic-aware Human Motion Understanding and Generation
:house:project - Continuous-Time Human Motion Field from Event Cameras
- Probabilistic Inertial Poser (ProbIP) Uncertainty-aware Human Motion Modeling from Sparse Inertial Sensors
- MVTrajecter Multi-View Pedestrian Tracking with Trajectory Motion Cost and Trajectory Appearance Cost
- StyleMotif Multi-Modal Motion Stylization using Style-Content Cross Fusion
:house:project - HumanSAM Classifying Human-centric Forgery Videos in Human Spatial Appearance and Motion Anomaly
- Contact-Aware Refinement of Human Pose Pseudo-Ground Truth via Bioimpedance Sensing
- Punching Bag vs Punching Person Motion Transferability in Videos
:star:code - 人体运动分割
- 运动生成
- PINO: Person-Interaction Noise Optimization for Long-Duration and Customizable Motion Generation of Arbitrary-Sized Groups
:star:code - RapVerse Coherent Vocals and Whole-Body Motion Generation from Text
- Go to Zero Towards Zero-shot Motion Generation with Million-scale Data
- PUMPS: Skeleton-Agnostic Point-based Universal Motion Pre-Training for Synthesis in Human Motion Tasks
- GenM3 Generative Pretrained Multi-path Motion Model for Text Conditional Human Motion Generation
- SMGDiff Soccer Motion Generation using Diffusion Probabilistic Models
- FineMotion A Dataset and Benchmark with both Spatial and Temporal Annotation for Fine-grained Motion Generation and Editing
- VMBench A Benchmark for Perception-Aligned Video Motion Generation
:star:code - Towards Immersive Human-X Interaction A Real-Time Framework for Physically Plausible Motion Synthesis
- I2VControl Disentangled and Unified Video Motion Synthesis Control
:house:project - MoMaps Semantics-Aware Scene Motion Generation with Motion Maps
- Morph A Motion-free Physics Optimization Framework for Human Motion Generation
- MaskControl Spatio-Temporal Control for Masked Motion Synthesis
:house:project - InterSyn Interleaved Learning for Dynamic Motion Synthesis in the Wild
- Motion Synthesis with Sparse and Flexible Keyjoint Control
- MotionStreamer Streaming Motion Generation via Diffusion-based Autoregressive Model in Causal Latent Space
:house:project :house:project - SemTalk Holistic Co-speech Motion Generation with Frame-level Semantic Emphasis
- InfiniDreamer Arbitrarily Long Human Motion Generation via Segment Score Distillation
- Text-to-Any-Skeleton Motion Generation Without Retargeting
- MotionLab Unified Human Motion Generation and Editing via the Motion-Condition-Motion Paradigm
:house:project - Motion-2-to-3 Leveraging 2D Motion Data for 3D Motion Generations
- DIMO Diverse 3D Motion Generation for Arbitrary Objects
- You Think You ACT The New Task of Arbitrary Text to Motion Generation
:star:code - Dual Reciprocal Learning of Language-based Human Motion Understanding and Generation
- PINO: Person-Interaction Noise Optimization for Long-Duration and Customizable Motion Generation of Arbitrary-Sized Groups
- 运动重建
- 运动估计
- Learning Large Motion Estimation from Intermediate Representations with a High-Resolution Optical Flow Dataset Featuring Long-Range Dynamic Motion
:star:code - EMoTive Event-guided Trajectory Modeling for 3D Motion Estimation
- MBTI Masked Blending Transformers with Implicit Positional Encoding for Frame-rate Agnostic Motion Estimation
- EgoMusic-driven Human Dance Motion Estimation with Skeleton Mamba
- Learning Large Motion Estimation from Intermediate Representations with a High-Resolution Optical Flow Dataset Featuring Long-Range Dynamic Motion
- 舞蹈生成
- DanceEditor Towards Iterative Editable Music-driven Dance Generation with Open-Vocabulary Descriptions
:house:project - FreeDance Towards Harmonic Free-Number Group Dance Generation via a Unified Framework
:star:code - MDD A Dataset for Text-and-Music Conditioned Duet Dance Generation
- Music-Aligned Holistic 3D Dance Generation via Hierarchical Motion Modeling
:house:project
- DanceEditor Towards Iterative Editable Music-driven Dance Generation with Open-Vocabulary Descriptions
- 运动编辑
15.pose
- Detection Pose Estimation and Segmentation for Multiple Bodies Closing the Virtuous Circle
- HIS-GPT Towards 3D Human-In-Scene Multimodal Understanding
- AdaHuman Animatable Detailed 3D Human Generation with Compositional Multiview Diffusion
- TriDi Trilateral Diffusion of 3D Humans Objects and Interactions
- PHD Personalized 3D Human Body Fitting with Point Diffusion
- SIGMAN Scaling 3D Human Gaussian Generation with Millions of Assets
- Group Inertial Poser Multi-Person Pose and Global Translation from Sparse Inertial Sensors and Ultra-Wideband Ranging
- 人体网格恢复
- 人体重建
- 手势合成
- SemGes: Semantics-aware Co-Speech Gesture Generation using Semantic Coherence and Relevance Learning
:star:code - Democratizing High-Fidelity Co-Speech Gesture Video Generation
:house:project - GestureLSM Latent Shortcut based Co-Speech Gesture Generation with Spatial-Temporal Modeling
- Understanding Co-speech Gestures in-the-wild
- GestureHYDRA Semantic Co-speech Gesture Synthesis via Hybrid Modality Diffusion Transformer and Cascaded-Synchronized Retrieval-Augmented Generation
- SemGes: Semantics-aware Co-Speech Gesture Generation using Semantic Coherence and Relevance Learning
- HPE
- LDPose Towards Inclusive Human Pose Estimation for Limb-Deficient Individuals in the Wild
- From Sharp to Blur Unsupervised Domain Adaptation for 2D Human Pose Estimation Under Extreme Motion Blur Using Event Cameras
:star:code - High-Resolution Spatiotemporal Modeling with Global-Local State Space Models for Video-Based Human Pose Estimation
- Generative Modeling of Shape-Dependent Self-Contact Human Poses
- 3D HPE
- DPoser-X: Diffusion Model as Robust 3D Whole-body Human Pose Prior
:star:code - A Structure-aware and Motion-adaptive Framework for 3D Human Pose Estimation with Mamba
- Bring Your Rear Cameras for Egocentric 3D Human Pose Estimation
- VOccl3D A Video Benchmark Dataset for 3D Human Pose and Shape Estimation under real Occlusions
- PoseAnchor Robust Root Position Estimation for 3D Human Pose Estimation
- PersPose 3D Human Pose Estimation with Perspective Encoding and Perspective Rotation
:star:code
- DPoser-X: Diffusion Model as Robust 3D Whole-body Human Pose Prior
- 人体姿态生成
- 手部姿态
- 手语生成
- Signs as Tokens A Retrieval-Enhanced Multilingual Sign Language Generator
- GReg Geometry-Aware Region Refinement for Sign Language Video Generation
- Cross-View Isolated Sign Language Recognition via View Synthesis and Feature Disentanglement
- Leveraging the Power of MLLMs for Gloss-Free Sign Language Translation
:star:code
- 关键点检测
14.Object Track(目标跟踪)
- Efficient Track Anything
:star:code - UMDATrack: Unified Multi-Domain Adaptive Tracking Under Adverse Weather Conditions
:star:code - Is Tracking really more challenging in First Person Egocentric Vision?
- What You Have is What You Track: Adaptive and Robust Multimodal Tracking
:star:code - Temporal Unlearnable Examples: Preventing Personal Video Data from Unauthorized Exploitation by Object Tracking
- General Compression Framework for Efficient Transformer Object Tracking
:star:code - COVTrack Continuous Open-Vocabulary Tracking via Adaptive Multi-Cue Fusion
- Attention to Trajectory Trajectory-Aware Open-Vocabulary Tracking
:star:code - BlinkTrack Feature Tracking over 80 FPS via Events and Images
:star:code - egoPPG Heart Rate Estimation from Eye-Tracking Cameras in Egocentric Systems to Benefit Downstream Vision Tasks
- M2EIT Multi-Domain Mixture of Experts for Robust Neural Inertial Tracking
- CAT A Unified Click-and-Track Framework for Realistic Tracking
:star:code - SMSTracker Tri-path Score Mask Sigma Fusion for Multi-Modal Tracking
:star:code - How To Make Your Cell Tracker Say I dunno
- What You Have is What You Track Adaptive and Robust Multimodal Tracking
- Tracking Tiny Drones against Clutter Large-Scale Infrared Benchmark with Motion-Centric Adaptive Algorithm
:star:code - 多目标跟踪
- 点跟踪
- TAPNext Tracking Any Point (TAP) as Next Token Prediction
- AllTracker Efficient Dense Point Tracking at High Resolution
- CoTracker3 Simpler and Better Point Tracking by Pseudo-Labelling Real Videos
- SpatialTrackerV2: 3D Point Tracking Made Easy
:house:project
:star:code - ReTracker Exploring Image Matching for Robust Online Any Point Tracking
- Event-aided Dense and Continuous Point Tracking Everywhere and Anytime
- Online Dense Point Tracking with Streaming Memory
:house:project - Multi-View 3D Point Tracking
:house:project - SpatialTrackerV2 Advancing 3D Point Tracking with Explicit Camera Motion
- MATE Motion-Augmented Temporal Consistency for Event-based Point Tracking
- 3D跟踪
- 视频目标跟踪
13.Object Detection(目标检测)
- DuET: Dual Incremental Object Detection via Exemplar-Free Task Arithmetic
- Visual Modality Prompt for Adapting Vision-Language Object Detectors
:star:code - Gradient Decomposition and Alignment for Incremental Object Detection
- Visual Textualization for Image Prompted Object Detection
:star:code - Task-Specific Zero-shot Quantization-Aware Training for Object Detection
:star:code - PBCAT: Patch-based composite adversarial training against physically realizable attacks on object detection
- UPRE: Zero-Shot Domain Adaptation for Object Detection via Unified Prompt and Representation Enhancement
:star:code - SFUOD: Source-Free Unknown Object Detection
- LMM-Det: Make Large Multimodal Models Excel in Object Detection
:star:code - Adversarial Attention Perturbations for Large Object Detection Transformers
:star:code - ForeSight: Multi-View Streaming Joint Object Detection and Trajectory Forecasting
- Gradient-Reweighted Adversarial Camouflage for Physical Object Detection Evasion
- Unified Category-Level Object Detection and Pose Estimation from RGB Images using 3D Prototypes
- Cycle-Consistent Learning for Joint Layout-to-Image Generation and Object Detection
- Rethinking Multi-modal Object Detection from the Perspective of Mono-Modality Feature Learning
:star:code - Automated Model Evaluation for Object Detection via Prediction Consistency and Reliability
:star:code - Diffusion-based Source-biased Model for Single Domain Generalized Object Detection
- VISO Accelerating In-orbit Object Detection with Language-Guided Mask Learning and Sparse Inference
- Beyond RGB Adaptive Parallel Processing for RAW Object Detection
- Dark-ISP Enhancing RAW Image Processing for Low-Light Object Detection
- DoppDrive Doppler-Driven Temporal Aggregation for Improved Radar Object Detection
:house:project :house:project - Continual Adaptation Environment-Conditional Parameter Generation for Object Detection in Dynamic Scenarios
- From Objects to Events Unlocking Complex Visual Understanding in Object Detectors via LLM-guided Symbolic Reasoning
- Online Generic Event Boundary Detection
- Revisiting Adversarial Patch Defenses on Object Detectors Unified Evaluation Large-Scale Dataset and New Insights
:star:code - 小目标检测
- 开集目标检测
- 三维目标检测
- Detect Anything 3D in the Wild
- OcRFDet: Object-Centric Radiance Fields for Multi-View 3D Object Detection in Autonomous Driving
:star:code - MambaFusion: Height-Fidelity Dense Global Fusion for Multi-modal 3D Object Detection
:star:code - Perspective-Invariant 3D Object Detection
:star:code - Boosting Multi-View Indoor 3D Object Detection via Adaptive 3D Volume Construction
:star:code - Motal Unsupervised 3D Object Detection by Modality and Task-specific Knowledge Transfer
- MemDistill Distilling LiDAR Knowledge into Memory for Camera-Only 3D Object Detection
- Accelerate 3D Object Detection Models via Zero-Shot Attention Key Pruning
:star:code - GeoFormer Geometry Point Encoder for 3D Object Detection with Graph-based Transformer
- Adaptive Dual Uncertainty Optimization Boosting Monocular 3D Object Detection under Test-Time Shifts
:star:code - EVT Efficient View Transformation for Multi-Modal 3D Object Detection
- CVFusion Cross-View Fusion of 4D Radar and Camera for 3D Object Detection
:star:code - OV-SCAN Semantically Consistent Alignment for Novel Object Discovery in Open-Vocabulary 3D Object Detection
- Unleashing the Temporal Potential of Stereo Event Cameras for Continuous-Time 3D Object Detection
:star:code - FreqPDE Rethinking Positional Depth Embedding for Multi-View 3D Object Detection Transformers
- Harnessing Uncertainty-aware Bounding Boxes for Unsupervised 3D Object Detection
- OpenM3D Open Vocabulary Multi-view Indoor 3D Object Detection without Human Annotations
- RCTDistill Cross-Modal Knowledge Distillation Framework for Radar-Camera 3D Object Detection with Temporal Fusion
- Towards Accurate and Efficient 3D Object Detection for Autonomous Driving A Mixture of Experts Computing System on Edge
- MonoSOWA Scalable Monocular 3D Object Detector Without Human Annotations
- Doppler-Aware LiDAR-RADAR Fusion for Weather-Robust 3D Detection
- CHARM3R Towards Unseen Camera Height Robust Monocular 3D Detector
- 伪装目标检测
- Rethinking Detecting Salient and Camouflaged Objects in Unconstrained Scenes
:star:code - ESCNetEdge-Semantic Collaborative Network for Camouflaged Object Detection
- Enhancing Prompt Generation with Adaptive Refinement for Camouflaged Object Detection
- Beyond Single Images Retrieval Self-Augmented Unsupervised Camouflaged Object Detection
:star:code - Improving SAM for Camouflaged Object Detection via Dual Stream Adapters
- Scoring Remember and Reference Catching Camouflaged Objects in Videos
- Rethinking Detecting Salient and Camouflaged Objects in Unconstrained Scenes
- 半监督目标检测
- 小样本目标检测
- 域适应目标检测
- 开放词汇目标检测
- 可见光红外目标检测
- 红外小目标检测
- From Easy to Hard Progressive Active Learning Framework for Infrared Small Target Detection with Single Point Supervision
:star:code - Text-IRSTD Leveraging Semantic Text to Promote Infrared Small Target Detection in Complex Scenes
:star:code - DISTA-Net Dynamic Closely-Spaced Infrared Small Target Unmixing
:star:code
- From Easy to Hard Progressive Active Learning Framework for Infrared Small Target Detection with Single Point Supervision
12.Avatar
- TeRA Rethinking Text-guided Realistic 3D Avatar Generation
- HairCUP: Hair Compositional Universal Prior for 3D Gaussian Avatars
:star:code - MoGA: 3D Generative Avatar Prior for Monocular Gaussian Avatar Reconstruction
:star:code
:star:code - PERSONA: Personalized Whole-Body 3D Avatar with Pose-Driven Deformations from a Single Image
:star:code
:star:code - Towards Explicit Exoskeleton for the Reconstruction of Complicated 3D Human Avatars
- Im2Haircut Single-view Strand-based Hair Reconstruction for Human Avatars
- HADES Human Avatar with Dynamic Explicit Hair Strands
- GaussianSpeech Audio-Driven Personalized 3D Gaussian Avatars
- GUAVA Generalizable Upper Body 3D Gaussian Avatar
- Disentangled Clothed Avatar Generation with Layered Representation
:house:project - GAS Generative Avatar Synthesis from a Single Image
- 虚拟头像
- Capturing head avatar with hand contacts from a monocular video
- OneGT One-Shot Geometry-Texture Neural Rendering for Head Avatars
- Avat3r Large Animatable Gaussian Reconstruction Model for High-fidelity 3D Head Avatars
- StrandHead Text to Hair-Disentangled 3D Head Avatars Using Human-Centric Priors
:house:project :house:project - Fine-Grained 3D Gaussian Head Avatars Modeling from Static Captures via Joint Reconstruction and Registration
- GeoAvatar: Adaptive Geometrical Gaussian Splatting for 3D Head Avatar
:star:code - Identity Preserving 3D Head Stylization with Multiview Score Distillation
11.Face
- IDFace: Face Template Protection for Efficient and Secure Identification
- FairHuman: Boosting Hand and Face Quality in Human Image Generation with Minimum Potential Delay Fairness in Diffusion Models
- F-Bench Rethinking Human Preference Evaluation Metrics for Benchmarking Face Generation Customization and Restoration
- DH-FaceVid-1K A Large-Scale High-Quality Dataset for Face Video Generation
:house:project :house:project - DynamicID Zero-Shot Multi-ID Image Personalization with Flexible Facial Editability
- Monocular Facial Appearance Capture in the Wild
- FPEM Face Prior Enhanced Facial Attractiveness Prediction for Live Videos with Face Retouching
:star:code - FaceXFormer A Unified Transformer for Facial Analysis
- FaceShield Defending Facial Image against Deepfake Threats
- TimeBooth Disentangled Facial Invariant Representation for Diverse and Personalized Face Aging
:star:code - InteractAvatar Modeling Hand-Face Interaction in Photorealistic Avatars with Deformable Gaussians
- MR-FIQA Face Image Quality Assessment with Multi-Reference Representations from Synthetic Data Generation
- Face Retouching with Diffusion Data Generation and Spectral Restorement
- FaceLift Learning Generalizable Single Image 3D Face Reconstruction from Synthetic Heads
- 人脸识别
- LVFace Progressive Cluster Optimization for Large Vision Models in Face Recognition
:star:code - Bi-Level Optimization for Self-Supervised AI-Generated Face Detection
- Stylized-Face A Million-level Stylized Face Dataset for Face Recognition
- VIGFace Virtual Identity Generation for Privacy-Free Face Recognition Dataset
- LVFace Progressive Cluster Optimization for Large Vision Models in Face Recognition
- 人脸恢复
- DynFaceRestore: Balancing Fidelity and Quality in Diffusion-Guided Blind Face Restoration with Dynamic Blur-Level Mapping and Guidance
- MoFRR Mixture of Diffusion Models for Face Retouching Restoration
- Unlocking the Potential of Diffusion Priors in Blind Face Restoration
- Dirichlet-Constrained Variational Codebook Learning for Temporally Coherent Video Face Restoration
:star:code
- 人脸表情识别
- Multimodal Prompt Alignment for Facial Expression Recognition
- AU-Blendshape for Fine-grained Stylized 3D Facial Expression Manipulation
:star:code - SynFER Towards Boosting Facial Expression Recognition with Synthetic Data
- SEREP Semantic Facial Expression Representation for Robust In-the-Wild Capture and Retargeting
- ContextFace Generating Facial Expressions from Emotional Contexts
- 说话头
- GGTalker: Talking Head Systhesis with Generalizable Gaussian Priors and Identity-Specific Adaptation
:star:code - DGTalker Disentangled Generative Latent Space Learning for Audio-Driven Gaussian Talking Heads
- FLOAT Generative Motion Latent Flow Matching for Audio-driven Talking Portrait
- Who is a Better Talker: Subjective and Objective Quality Assessment for AI-Generated Talking Heads
:star:code - ARIG: Autoregressive Interactive Head Generation for Real-time Conversations
:star:code - FixTalk Taming Identity Leakage for High-Quality Talking Head Generation in Extreme Cases
- Audio-visual Controlled Video Diffusion with Masked Selective State Spaces Modeling for Natural Talking Head Generation
- GGTalker: Talking Head Systhesis with Generalizable Gaussian Priors and Identity-Specific Adaptation
- 人脸交换
- CanonSwap: High-Fidelity and Consistent Video Face Swapping via Canonical Space Modulation
:house:project - Controllable and Expressive One-Shot Video Head Swapping
- NullSwap Proactive Identity Cloaking Against Deepfake Face Swapping
- DynamicFace High-Quality and Consistent Face Swapping for Image and Video using Composable 3D Facial Priors
:house:project
- CanonSwap: High-Fidelity and Consistent Video Face Swapping via Canonical Space Modulation
- 活体检测
- 三维人脸动画
- 微表情识别
- 人脸关键点检测
- 头部重建
10.Medical Image Progress(医学图像处理)
- Medical World Model
- MedSegFactory Text-Guided Generation of Medical Image-Mask Pairs
- Towards a Universal 3D Medical Multi-modality Generalization via Learning Personalized Invariant Representation
- Is Visual in-Context Learning for Compositional Medical Tasks within Reach?
- FB-Diff: Fourier Basis-guided Diffusion for Temporal Interpolation of 4D Medical Imaging
- Integrating Biological Knowledge for Robust Microscopy Image Profiling on De Novo Cell Lines
- Tiling artifacts and trade-offs of feature normalization in the segmentation of large biological images
:star:code - ProbMED A Probabilistic Framework for Medical Multimodal Binding
- Alleviating Textual Reliance in Medical Language-guided Segmentation via Prototype-driven Semantic Approximation
- M-Net: MRI Brain Tumor Sequential Segmentation Network via Mesh-Cast
- Beyond Brain Decoding Visual-Semantic Reconstructions to Mental Creation Extension Based on fMRI
- MRGen Segmentation Data Engine For Underrepresented MRI Modalities
:house:project - Learn2Synth Learning Optimal Data Synthesis Using Hypergradients for Brain Image Segmentation
:star:code - MetaScope: Optics-Driven Neural Network for Ultra-Micro Metalens Endoscopy
:star:code
:star:code - COME: Dual Structure-Semantic Learning with Collaborative MoE for Universal Lesion Detection Across Heterogeneous Ultrasound Datasets
- CoStoDet-DDPM Collaborative Training of Stochastic and Deterministic Models Improves Surgical Workflow Anticipation and Recognition
- Optimal Transport for Brain-Image Alignment Unveiling Redundancy and Synergy in Neural Information Processing
:star:code - SAMora Enhancing SAM through Hierarchical Self-Supervised Pre-Training for Medical Images
- CoSMIC Continual Self-supervised Learning for Multi-Domain Medical Imaging via Conditional Mutual Information Maximization
- Test-time Adaptation for Foundation Medical Segmentation Model Without Parametric Updates
- AcZeroTS Active Learning for Zero-shot Tissue Segmentation in Pathology Images
- TokenUnify Scaling Up Autoregressive Pretraining for Neuron Segmentation
:star:code - Keep Your Friends Close and Your Enemies Farther Distance-aware Voxel-wise Contrastive Learning for Semi-supervised Multi-organ Segmentation
- Breaking Grid Constraints Dynamic Graph Reconstruction Network for Multi-organ Segmentation
:star:code - Seeing the Trees for the Forest Rethinking Weakly-Supervised Medical Visual Grounding
- GEMeX A Large-Scale Groundable and Explainable Medical VQA Benchmark for Chest X-ray Diagnosis
- Debiased Curriculum Adaptation for Safe Transfer Learning in Chest X-ray Classification
- Scaling Tumor Segmentation Best Lessons from Real and Synthetic Data
- PathFinder A Multi-Modal Multi-Agent System for Medical Diagnostic Decision-Making Applied to Histopathology
- GECKO Gigapixel Vision-Concept Contrastive Pretraining in Histopathology
:star:code - Boosting Vision Semantic Density with Anatomy Normality Modeling for Medical Vision-language Pre-training
- TPG-INR Target Prior-Guided Implicit 3D CT Reconstruction for Enhanced Sparse-view Imaging
- 医学图像分割
- Adaptive Learning of High-Value Regions for Semi-Supervised Medical Image Segmentation
:star:code - Teaching AI the Anatomy Behind the Scan Addressing Anatomical Flaws in Medical Image Segmentation with Learnable Prior
- UKBOB One Billion MRI Labeled Masks for Generalizable 3D Medical Image Segmentation
- MaskSAM Auto-prompt SAM with Mask Classification for Volumetric Medical Image Segmentation
:star:code - Progressive Test Time Energy Adaptation for Medical Image Segmentation
:house:project - Toward Fair and Accurate Cross-Domain Medical Image Segmentation A VLM-Driven Active Domain Adaptation Paradigm
:star:code - SPA Efficient User-Preference Alignment against Uncertainty in Medical Image Segmentation
:star:code - Similarity Memory Prior is All You Need for Medical Image Segmentation
:star:code
- Adaptive Learning of High-Value Regions for Semi-Supervised Medical Image Segmentation
- 医学图像融合
- 报告生成
- 切片分析
- Continual Multiple Instance Learning with Enhanced Localization for Histopathological Whole Slide Image Analysis
- Cracking Instance Jigsaw Puzzles: An Alternative to Multiple Instance Learning for Whole Slide Image Analysis
:star:code - Bridging Local Inductive Bias and Long-Range Dependencies with Pixel-Mamba for End-to-end Whole Slide Image Analysis
- WSI-LLaVA A Multimodal Large Language Model for Whole Slide Image
- 切片分割
- 3D医学
- 息肉分割
- 医学影像隐私保护
- 细胞分割
- 关键点检测
9.Image/Video Compression(图像/视频压缩)
- Compression-Aware One-Step Diffusion Model for JPEG Artifact Removal
:star:code - Compression of 3D Gaussian Splatting with Optimized Feature Planes and Standard Video Codecs
- 图像压缩
- Learned Image Compression with Hierarchical Progressive Context Modeling
:star:code - Cassic Towards Content-Adaptive State-Space Models for Learned Image Compression
- An Information-Theoretic Regularizer for Lossy Neural Image Compression
- Cross-Granularity Online Optimization with Masked Compensated Information for Learned Image Compression
:house:project - Knowledge Distillation for Learned Image Compression
- DLF Extreme Image Compression with Dual-generative Latent Fusion
:house:project - StableCodec Taming One-Step Diffusion for Extreme Image Compression
- Learned Image Compression with Hierarchical Progressive Context Modeling
- VC
- 视频编解码
- HyTIP Hybrid Temporal Information Propagation for Masked Conditional Residual Video Coding
:star:code - MH-LVC Multi-Hypothesis Temporal Prediction for Learned Conditional Residual Video Coding
- ResidualViT for Efficient Temporally Dense Video Encoding
- EEGMirror Leveraging EEG Data in the Wild via Montage-Agnostic Self-Supervision for EEG to Video Decoding
:star:code
- HyTIP Hybrid Temporal Information Propagation for Masked Conditional Residual Video Coding
- 视频自动编码
8.Image/Video Retrieval(图像/视频检索)
- Adversarial Reconstruction Feedback for Robust Fine-grained Generalization
- MobileViCLIP: An Efficient Video-Text Model for Mobile Devices
:star:code - Describe, Adapt and Combine: Empowering CLIP Encoders for Open-set 3D Object Retrieval
:star:code - Taming the Untamed Graph-Based Knowledge Retrieval and Reasoning for MLLMs to Conquer the Unknown
- 图像检索
- 视频检索
- HLFormer: Enhancing Partially Relevant Video Retrieval with Hyperbolic Learning
:star:code - Enhancing Partially Relevant Video Retrieval with Hyperbolic Learning
:star:code - Beyond Simple Edits Composed Video Retrieval with Dense Modifications
- Prototypes are Balanced Units for Efficient and Effective Partially Relevant Video Retrieval
- HLFormer: Enhancing Partially Relevant Video Retrieval with Hyperbolic Learning
- 文本-视频检索
- Quantifying and Narrowing the Unknown: Interactive Text-to-Video Retrieval via Uncertainty Minimization
- Bidirectional Likelihood Estimation with Multi-Modal Large Language Models for Text-Video Retrieval
:star:code - Hybrid-Tower Fine-grained Pseudo-query Interaction and Generation for Text-to-Video Retrieval
- 组合图像检索
- Zero-Shot Composed Image Retrieval via Dual-Stream Instruction-Aware Distillation
- Hierarchy-Aware Pseudo Word Learning with Text Adaptation for Zero-Shot Composed Image Retrieval
- An Efficient Post-hoc Framework for Reducing Task Discrepancy of Text Encoders for Composed Image Retrieval
:star:code - Multi-Schema Proximity Network for Composed Image Retrieval
- CoTMR Chain-of-Thought Multi-Scale Reasoning for Training-Free Zero-Shot Composed Image Retrieval
- MA-CIR A Multimodal Arithmetic Benchmark for Composed Image Retrieval
:star:code
7.Image Classification(图像分类)
- Generate, Refine, and Encode: Leveraging Synthesized Novel Samples for On-the-Fly Fine-Grained Category Discovery
- Competitive Distillation: A Simple Learning Strategy for Improving Visual Classification
- CNS-Bench: Benchmarking Image Classifier Robustness Under Continuous Nuisance Shifts
:star:code - I Am Big, You Are Little; I Am Right, You Are Wrong
- Is Meta-Learning Out Rethinking Unsupervised Few-Shot Classification with Limited Entropy
- Synergistic Prompting for Robust Visual Recognition with Missing Modalities
- MolParser End-to-end Visual Recognition of Molecule Structures in the Wild
- Think Twice Test-Time Reasoning for Robust CLIP Zero-Shot Classification
- Supervised Exploratory Learning for Long-Tailed Visual Recognition
- Hierarchical Divide-and-Conquer Grouping for Classification Adaptation of Pre-Trained Models
- Long-Tailed Classification with Multi-Granularity Semantics
- On Large Multimodal Models as Open-World Image Classifiers
- NAPPure Adversarial Purification for Robust Image Classification under Non-Additive Perturbations
- MPBR Multimodal Progressive Bidirectional Reasoning for Open-Set Fine-Grained Recognition
- 细粒度分类
- Vocabulary-free Fine-grained Visual Recognition via Enriched Contextually Grounded Vision-Language Model
:star:code - LLM-assisted Entropy-based Adaptive Distillation for Unsupervised Fine-grained Visual Representation Learning
- Learning Separable Fine-Grained Representation via Dendrogram Construction from Coarse Labels for Fine-grained Visual Recognition
:star:code
- Vocabulary-free Fine-grained Visual Recognition via Enriched Contextually Grounded Vision-Language Model
- 图像分类
- Looking in the Mirror A Faithful Counterfactual Explanation Method for Interpreting Deep Image Classification Models
- SIC Similarity-Based Interpretable Image Classification with Neural Networks
- Learning Interpretable Queries for Explainable Image Classification with Information Pursuit
- Category-Specific Selective Feature Enhancement for Long-Tailed Multi-Label Image Classification
- MambaML Exploring State Space Models for Multi-Label Image Classification
- 广义类别发现
6.Image Segmentation(图像分割)
- SAM4D: Segment Anything in Camera and LiDAR Streams
:star:code - Flow Stochastic Segmentation Networks
:star:code - Inter2Former: Dynamic Hybrid Attention for Efficient High-Precision Interactive
- SAM Encoder Breach by Adversarial Simplicial Complex Triggers Downstream Model Failures
- Correspondence as Video Test-Time Adaption on SAM2 for Reference Segmentation in the Wild
:star:code - ProSAM Enhancing the Robustness of SAM-based Visual Reference Segmentation with Probabilistic Prompts
- RA-BUSSeg Relation-aware Semi-supervised Breast Ultrasound Image Segmentation via Adjacent Propagation and Cross-layer Alignment
:star:code - DictAS A Framework for Class-Generalizable Few-Shot Anomaly Segmentation via Dictionary Lookup
- Enhancing Spatial Reasoning in Multimodal Large Language Models through Reasoning-based Segmentation
- Intermediate Connectors and Geometric Priors for Language-Guided Affordance Segmentation on Unseen Object Categories
- Unified Open-World Segmentation with Multi-Modal Prompts
- LawDIS Language-Window-based Controllable Dichotomous Image Segmentation
:star:code - HiMTok Learning Hierarchical Mask Tokens for Image Segmentation with Large Multimodal Model
- InstructSeg Unifying Instructed Visual Segmentation with Multi-modal Large Language Models
- ViLLa Video Reasoning Segmentation with Large Language Model
- LIRA Inferring Segmentation in Large Multi-modal Models with Local Interleaved Region Assistance
:star:code - Text-guided Visual Prompt DINO for Generic Segmentation
:star:code - HyPiDecoder Hybrid Pixel Decoder for Efficient Segmentation and Detection
:star:code - Refer to Any Segmentation Mask Group With Vision-Language Prompts
:house:project - Adapt Foundational Segmentation Models with Heterogeneous Searching Space
- SegAnyPET Universal Promptable Segmentation from Positron Emission Tomography Images
- Multi-scenario Overlapping Text Segmentation with Depth Awareness
- Trace3D Consistent Segmentation Lifting via Gaussian Instance Tracing
- TopoTTA Topology-Enhanced Test-Time Adaptation for Tubular Structure Segmentation
- FE-CLIP Frequency Enhanced CLIP Model for Zero-Shot Anomaly Detection and Segmentation
- ReferEverything Towards Segmenting Everything We Can Speak of in Videos
- 部分分割
- 场景分割
- 目标分割
- Seeing the Unseen A Semantic Alignment and Context-Aware Prompt Framework for Open-Vocabulary Camouflaged Object Segmentation
- Controllable-LPMoE Adapting to Challenging Object Segmentation via Dynamic Local Priors from Mixture-of-Experts
- Breaking Rectangular Shackles Cross-View Object Segmentation for Fine-Grained Object Geo-Localization
:house:project - Temporal Overlapping Prediction A Self-supervised Pre-training Method for LiDAR Moving Object Segmentation
:star:code
- 抠图
- 小样本分割
- Balancing Conservatism and Aggressiveness: Prototype-Affinity Hybrid Network for Few-Shot Segmentation
:star:code - DeFSS Image-to-Mask Denoising Learning for Few-shot Segmentation
- Adapting In-Domain Few-Shot Segmentation to New Domains without Source Domain Retraining
:star:code - Balancing Conservatism and Aggressiveness Prototype-Affinity Hybrid Network for Few-Shot Segmentation
:star:code - Object-level Correlation for Few-Shot Segmentation
- Balancing Conservatism and Aggressiveness: Prototype-Affinity Hybrid Network for Few-Shot Segmentation
- 开放词汇分割
- ReME: A Data-Centric Framework for Training-Free Open-Vocabulary Segmentation
:star:code - Stepping Out of Similar Semantic Space for Open-Vocabulary Segmentation
- Plug-in Feedback Self-adaptive Attention in CLIP for Training-free Open-Vocabulary Segmentation
- Talking to DINO Bridging Self-Supervised Vision Backbones with Language for Open-Vocabulary Segmentation
- Harnessing Vision Foundation Models for High-Performance Training-Free Open Vocabulary Segmentation
:star:code
- ReME: A Data-Centric Framework for Training-Free Open-Vocabulary Segmentation
- 实例分割
- Details Matter for Indoor Open-vocabulary 3D Instance Segmentation
- OV3D-CG Open-vocabulary 3D Instance Segmentation with Contextual Guidance
:house:project - MOBIUS Big-to-Mobile Universal Instance Segmentation via Multi-modal Bottleneck Fusion and Calibrated Decoder Pruning
- WeaveSeg Iterative Contrast-weaving and Spectral Feature-refining for Nuclei Instance Segmentation
- CutS3D Cutting Semantics in 3D for 2D Unsupervised Instance Segmentation
- S4M Boosting Semi-Supervised Instance Segmentation with SAM
- 零样本实例分割
- 全景分割
- 语义分割
- Exploring Probabilistic Modeling Beyond Domain Generalization for Semantic Segmentation
- Revisiting Efficient Semantic Segmentation: Learning Offsets for Better Spatial and Class Feature Alignment
:star:code - Revisiting Efficient Semantic Segmentation Learning Offsets for Better Spatial and Class Feature Alignment
- Identity-aware Language Gaussian Splatting for Open-vocabulary 3D Semantic Segmentation
- Incremental Few-Shot Semantic Segmentation via Multi-Level Switchable Visual Prompts
:star:code - Reducing Unimodal Bias in Multi-Modal Semantic Segmentation with Multi-Scale Functional Entropy Regularization
- UniDxMD Towards Unified Representation for Cross-Modal Unsupervised Domain Adaptation in 3D Semantic Segmentation
- Exploiting Domain Properties in Language-Driven Domain Generalization for Semantic Segmentation
- Auto-Vocabulary Semantic Segmentation
:star:code - Stronger Steadier Superior Geometric Consistency in Depth VFM Forges Domain Generalized Semantic Segmentation
:star:code - Unsupervised Histopathological Image Semantic Segmentation with Overlapping Patches Consistency Constraint
- Pseudo-SD Pseudo Controlled Stable Diffusion for Semi-Supervised and Cross-Domain Semantic Segmentation
:star:code - Learning Yourself Class-Incremental Semantic Segmentation with Language-Inspired Bootstrapped Disentanglement
- Exploring Weather-aware Aggregation and Adaptation for Semantic Segmentation under Adverse Conditions
- OmniSAM Omnidirectional Segment Anything Model for UDA in Panoramic Semantic Segmentation
- CoralSRT Revisiting Coral Reef Semantic Segmentation by Feature Rectification via Self-supervised Guidance
- Communication-Efficient Multi-Vehicle Collaborative Semantic Segmentation via Sparse 3D Gaussian Sharing
- 半监督语义分割
- ConformalSAM: Unlocking the Potential of Foundational Segmentation Models in Semi-Supervised Semantic Segmentation with Conformal Prediction
- When Confidence Fails Revisiting Pseudo-Label Selection in Semi-supervised Semantic Segmentation
:star:code - Two Losses One Goal Balancing Conflict Gradients for Semi-supervised Semantic Segmentation
- 弱监督语义分割
- Bias-Resilient Weakly Supervised Semantic Segmentation Using Normalizing Flows
:star:code - Know Your Attention Maps Class-specific Token Masking for Weakly Supervised Semantic Segmentation
- 小样本语义分割
- 开放词汇语义分割
- Personalized OVSS: Understanding Personal Concept in Open-Vocabulary Semantic Segmentation
- Understanding Personal Concept in Open-Vocabulary Semantic Segmentation
- Training-Free Class Purification for Open-Vocabulary Semantic Segmentation
- Images as Noisy Labels Unleashing the Potential of the Diffusion Model for Open-Vocabulary Semantic Segmentation
- FLOSS Free Lunch in Open-vocabulary Semantic Segmentation
:star:code - DIH-CLIP Unleashing the Diversity of Multi-Head Self-Attention for Training-Free Open-Vocabulary Semantic Segmentation
- CLIPer Hierarchically Improving Spatial Representation of CLIP for Open-Vocabulary Semantic Segmentation
:star:code - CorrCLIP Reconstructing Patch Correlations in CLIP for Open-Vocabulary Semantic Segmentation
:star:code - CLIP-Adapted Region-to-Text Learning for Generative Open-Vocabulary Semantic Segmentation
- 指代图像分割
- 视频分割
- 交互分割
- Towards Fine-grained Interactive Segmentation in Images and Videos
- DC-TTA Divide-and-Conquer Framework for Test-Time Adaptation of Interactive Segmentation
:star:code - Inter2Former Dynamic Hybrid Attention for Efficient High-Precision Interactive Segmentation
- Easy3D A Simple Yet Effective Method for 3D Interactive Segmentation
- MultiverSeg Scalable Interactive Segmentation of Biomedical Imaging Datasets with In-Context Guidance
- VIS
- Latest Object Memory Management for Temporally Consistent Video Instance Segmentation
:star:code
:star:code - LOMM Latest Object Memory Management for Temporally Consistent Video Instance Segmentation
- Hierarchical Visual Prompt Learning for Continual Video Instance Segmentation
:star:code - CAVIS Context-Aware Video Instance Segmentation
- Temporal-aware Query Routing for Real-time Video Instance Segmentation
- Sliced Wasserstein Bridge for Open-Vocabulary Video Instance Segmentation
- Latest Object Memory Management for Temporally Consistent Video Instance Segmentation
- VOS
- MOVE: Motion-Guided Few-Shot Video Object Segmentation
:house:project - Structure Matters Revisiting Boundary Refinement in Video Object Segmentation
- EVOLVE Event-Guided Deformable Feature Transfer and Dual-Memory Refinement for Low-Light Video Object Segmentation
- ReferDINO Referring Video Object Segmentation with Visual Grounding Foundations
- MPG-SAM 2 Adapting SAM 2 with Mask Priors and Global Context for Referring Video Object Segmentation
:star:code
- MOVE: Motion-Guided Few-Shot Video Object Segmentation
- VSS
- GRES
5.Image Generation(图像生成)
- Rethinking Layered Graphic Design Generation with a Top-Down Approach
- QR-LoRA: Efficient and Disentangled Fine-tuning via QR Decomposition for Customized Generation
- Text Embedding Knows How to Quantize Text-Guided Diffusion Models
- Distilling Parallel Gradients for Fast ODE Solvers of Diffusion Models
:star:code - Stable Diffusion Models are Secretly Good at Visual In-Context Learning
- FlexGen Flexible Multi-View Generation from Text and Image Inputs
- Model Reveals What to Cache Profiling-Based Feature Reuse for Video Diffusion Models
- DAViD Modeling Dynamic Affordance of 3D Objects Using Pre-trained Video Diffusion Models
- From Prompt to Progression Taming Video Diffusion Models for Seamless Attribute Transition
- Dynamic Typography Bringing Text to Life via Video Diffusion Prior
- Mobile Video Diffusion
:house:project - Latent-Reframe Enabling Camera Control for Video Diffusion Models without Training
- LangScene-X Reconstruct Generalizable 3D Language-Embedded Scenes with TriMap Video Diffusion
- NormalCrafter Learning Temporally Consistent Normals from Video Diffusion Priors
:house:project - Prompt-A-Video Prompt Your Video Diffusion Model via Preference-Aligned LLM
- DimensionX Create Any 3D and 4D Scenes from a Single Image with Decoupled Video Diffusion
- CameraCtrl II Dynamic Scene Exploration via Camera-controlled Video Diffusion Models
- Beyond Next-Token Next-X Prediction for Autoregressive Visual Generation
:house:project - SpectralAR Spectral Autoregressive Visual Generation
:house:project :house:project - Bridging Continuous and Discrete Tokens for Autoregressive Visual Generation
:house:project - Randomized Autoregressive Visual Generation
:star:code - PUMA Empowering Unified MLLM with Multi-granular Visual Generation
- Neighboring Autoregressive Modeling for Efficient Visual Generation
- RealGeneral Unifying Visual Generation via Temporal In-Context Learning with Video Models
:star:code :house:project :house:project - 3D Mesh Editing using Masked LRMs
- PixTalk Controlling Photorealistic Image Processing and Editing with Language
- TextMaster A Unified Framework for Realistic Text Editing via Glyph-Style Dual-Control
- FlowEdit Inversion-Free Text-Based Editing Using Pre-Trained Flow Models
- Efficient Autoregressive Shape Generation via Octree-Based Adaptive Tokenization
- ObjectMate A Recurrence Prior for Object Insertion and Subject-Driven Generation
- WikiAutoGen Towards Multi-Modal Wikipedia-Style Article Generation
- CMT A Cascade MAR with Topology Predictor for Multimodal Conditional CAD Generation
- MaterialMVP Illumination-Invariant Material Generation via Multi-view PBR Diffusion
- Unleashing Vecset Diffusion Model for Fast Shape Generation
:star:code - LaneDiffusion Improving Centerline Graph Learning via Prior Injected BEV Feature Generation
- ScanEdit Hierarchically-Guided Functional 3D Scan Editing
- NeuralSVG An Implicit Representation for Text-to-Vector Generation
- Fine-Tuning Visual Autogressive Models for Subject-Driven Generation
- 扩散模型
- Penalizing Boundary Activation for Object Completeness in Diffusion Models
- Golden Noise for Diffusion Models A Learning Framework
- Learning 3D Object Spatial Relationships from Pre-trained 2D Diffusion Models
- DiffDoctor Diagnosing Image Diffusion Models Before Treating
- LoRAverse A Submodular Framework to Retrieve Diverse Adapters for Diffusion Models
- Revelio Interpreting and leveraging semantic information in diffusion models
:star:code - Latent Diffusion Models with Masked AutoEncoders
- Timestep-Aware Diffusion Model for Extreme Image Rescaling
:star:code - Bootstrap3D Improving Multi-view Diffusion Model with Synthetic Data
- DiffSim Taming Diffusion Models for Evaluating Visual Similarity
- 布局生成
- 图像合成
- Preserve Anything: Controllable Image Synthesis with Object Preservation
- PathDiff Histopathology Image Synthesis with Unpaired Text and Mask Conditions
:star:code - Rethinking Discrete Tokens Treating Them as Conditions for Continuous Autoregressive Image Synthesis
:house:project - AIComposer: Any Style and Content Image Composition via Feature Integration
:star:code - Toward Better Out-painting Improving the Image Composition with Initialization Policy Model
- ViCTr Vital Consistency Transfer for Pathology Aware Image Synthesis
- InfGen A Resolution-Agnostic Paradigm for Scalable Image Synthesis
- Frequency-Aware Autoregressive Modeling for Efficient High-Resolution Image Synthesis
- AM-Adapter Appearance Matching Adapter for Exemplar-based Semantic Image Synthesis in-the-Wild
- Leveraging BEV Paradigm for Ground-to-Aerial Image Synthesis
:house:project - PolarAnything Diffusion-based Polarimetric Image Synthesis
- 图像生成
- DC-AR: Efficient Masked Autoregressive Image Generation with Deep Compression Hybrid Tokenizer
- Scale Your Instructions: Enhance the Instruction-Following Fidelity of Unified Image Generation Model by Self-Adaptive Attention Scaling
:star:code - HPSv3: Towards Wide-Spectrum Human Preference Score
- Anti-Tamper Protection for Unauthorized Individual Image Generation
:star:code - Lumina-Image 20 A Unified and Efficient Image Generative Framework
:star:code - Enhancing Reward Models for High-quality Image Generation: Beyond Text-Image Alignment
:star:code - LOTS of Fashion! Multi-Conditioning for Image Generation via Sketch-Text Pairing
:star:code
:star:code - Trade-offs in Image Generation: How Do Different Dimensions Interact?
:star:code - Grouped Speculative Decoding for Autoregressive Image Generation
:star:code - LaRender: Training-Free Occlusion Control in Image Generation via Latent Rendering
:star:code - HypDAE Hyperbolic Diffusion Autoencoders for Hierarchical Few-shot Image Generation
- USP Unified Self-Supervised Pretraining for Image Generation and Understanding
- Holistic Tokenizer for Autoregressive Image Generation
:star:code - HDR Image Generation via Gain Map Decomposed Diffusion
- EmotiCrafter Text-to-Emotional-Image Generation based on Valence-Arousal Model
- MV-Adapter Multi-View Consistent Image Generation Made Easy
- Enhancing Reward Models for High-quality Image Generation Beyond Text-Image Alignment
- CAP Evaluation of Persuasive and Creative Image Generation
- Trade-offs in Image Generation How Do Different Dimensions Interact
:star:code - VisualCloze A Universal Image Generation Framework via Visual In-Context Learning
:house:project - LiT Delving into a Simple Linear Diffusion Transformer for Image Generation
- UniVG A Generalist Diffusion Model for Unified Image Generation and Editing
- CompSlider Compositional Slider for Disentangled Multiple-Attribute Image Generation
- DC-ControlNet Decoupling Inter- and Intra-Element Conditions in Image Generation with Diffusion Models
- Wasserstein Style Distribution Analysis and Transform for Stylized Image Generation
- LLM Thought Divergence and Convergence for Dialogue-Based Image Generation Control
- Unleashing High-Quality Image Generation in Diffusion Sampling Using Second-Order Levenberg-Marquardt-Langevin
- The Silent Assistant NoiseQuery as Implicit Guidance for Goal-Driven Image Generation
- FICGen Frequency-Inspired Contextual Disentanglement for Layout-driven Degraded Image Generation
- PlanGen Towards Unified Layout Planning and Image Generation in Auto-Regressive Vision Language Models
- GigaTok Scaling Visual Tokenizers to 3 Billion Parameters for Autoregressive Image Generation
- Contrastive Test-Time Composition of Multiple LoRA Models for Image Generation
- GeoDiffusion A Training-Free Framework for Accurate 3D Geometric Conditioning in Image Generation
- LoRArar Learning to Merge LoRAs via Hypernetworks for Subject-Style Conditioned Image Generation
- IntrinsicControlNet Cross-distribution Image Generation with Real and Unreal
- Dual-Process Image Generation
- LMM4LMM Benchmarking and Evaluating Large-multimodal Image Generation with LMMs
:star:code
- 文本-图像
- Rethink Sparse Signals for Pose-guided Text-to-image Generation
:star:code - CharaConsist: Fine-Grained Consistent Character Generation
:star:code
:star:code - Multimodal LLMs as Customized Reward Models for Text-to-Image Generation
:star:code - TeEFusion: Blending Text Embeddings to Distill Classifier-Free Guidance
:star:code - T2I-Copilot: A Training-Free Multi-Agent Text-to-Image System for Enhanced Prompt Interpretation and Interactive Generation
:star:code - YOLO-Count: Differentiable Object Counting for Text-to-Image Generation
- Steering Guidance for Personalized Text-to-Image Diffusion Models
- PLA: Prompt Learning Attack against Text-to-Image Generative Models
- ROVI A VLM-LLM Re-Captioned Dataset for Open-Vocabulary Instance-Grounded Text-to-Image Generation
:star:code - FedDifRC Unlocking the Potential of Text-to-Image Diffusion Models in Heterogeneous Federated Learning
- Draw Your Mind: Personalized Generation via Condition-Level Modeling in Text-to-Image Diffusion Models
- Automated Red Teaming for Text-to-Image Models through Feedback-Guided Prompt Iteration with Vision-Language Models
:star:code - Scene Graph Guided Generation Enable Accurate Relations Generation in Text-to-Image Models via Textural Rectification
- ImageGen-CoT Enhancing Text-to-Image In-context Learning with Chain-of-Thought Reasoning
:house:project - Holistic Unlearning Benchmark A Multi-Faceted Evaluation for Text-to-Image Diffusion Model Unlearning
- Efficient Input-level Backdoor Defense on Text-to-Image Synthesis via Neuron Activation Variation
- Leveraging Panoptic Scene Graph for Evaluating Fine-Grained Text-to-Image Generation
- Decoding Correlation-Induced Misalignment in the Stable Diffusion Workflow for Text-to-Image Generation
- Dense2MoE Restructuring Diffusion Transformer to MoE for Efficient Text-to-Image Generation
- Fair Generation without Unfair Distortions Debiasing Text-to-Image Generation with Entanglement-Free Attention
- Adaptive Routing of Text-to-Image Generation Requests Between Large Cloud Model and Light-Weight Edge Model
- IFAdapter Instance Feature Control for Grounded Text-to-Image Generation
- AutoPrompt Automated Red-Teaming of Text-to-Image Models via LLM-Driven Adversarial Prompts
- AlignGuard Scalable Safety Alignment for Text-to-Image Generation
- UniversalBooth Model-Agnostic Personalized Text-to-Image Generation
- TF-TI2I Training-Free Text-and-Image-to-Image Generation via Multi-Modal Implicit-Context Learning In Text-to-Image Models
- LBM Latent Bridge Matching for Fast Image-to-Image Translation
- Scalable Ranked Preference Optimization for Text-to-Image Generation
- RAGD Regional-Aware Diffusion Model for Text-to-Image Generation
- TRCE Towards Reliable Malicious Concept Erasure in Text-to-Image Diffusion Models
- Region-Level Data Attribution for Text-to-Image Generative Models
- FairGen Enhancing Fairness in Text-to-Image Diffusion Models via Self-Discovering Latent Directions
- DIMCIM A Quantitative Evaluation Framework for Default-mode Diversity and Generalization in Text-to-Image Generative Models
- From Reflection to Perfection Scaling Inference-Time Optimization for Text-to-Image Diffusion Models via Reflection Tuning
:house:project - CoMPaSS Enhancing Spatial Understanding in Text-to-Image Diffusion Models
- SuMa A Subspace Mapping Approach for Robust and Effective Concept Erasure in Text-to-Image Diffusion Models
- VSC Visual Search Compositional Text-to-Image Diffusion Model
- Democratizing Text-to-Image Masked Generative Models with Compact Text-Aware One-Dimensional Tokens
- Reflect-DiT Inference-Time Scaling for Text-to-Image Diffusion Transformers via In-Context Reflection
- Supercharged One-step Text-to-Image Diffusion Models with Negative Prompts
- TCFG Truncated Classifier-Free Guidance for Efficient and Scalable Text-to-Image Acceleration
- Discovering Divergent Representations between Text-to-Image Models
- CuRe Cultural Gaps in the Long Tail of Text-to-Image Systems
:house:project - Transformed Low-rank Adaptation via Tensor Decomposition and Its Applications to Text-to-image Models
- Generating Multi-Image Synthetic Data for Text-to-Image Customization
- Parametric Shadow Control for Portrait Generation in Text-to-Image Diffusion Models
- Reusing Computation in Text-to-Image Diffusion for Efficient Generation of Image Sets
- DreamRenderer Taming Multi-Instance Attribute Control in Large-Scale Text-to-Image Models
:house:project :house:project - Scalable Dual Fingerprinting for Hierarchical Attribution of Text-to-Image Models
- Who Controls the Authorization Invertible Networks for Copyright Protection in Text-to-Image Synthesis
- Dual Recursive Feedback on Generation and Appearance Latents for Pose-Robust Text-to-Image Diffusion
:star:code
- Rethink Sparse Signals for Pose-guided Text-to-image Generation
- 文本-视频
- TITAN-Guide: Taming Inference-Time AligNment for Guided Text-to-Video Diffusion Models
:star:code - BadVideo Stealthy Backdoor Attack against Text-to-Video Generation
:house:project - VPO Aligning Text-to-Video Generation Models with Prompt Optimization
:star:code - MotionShot Adaptive Motion Transfer across Arbitrary Objects for Text-to-Video Generation
:house:project - Free2Guide Training-Free Text-to-Video Alignment using Image LVLM
:house:project - ETVA Evaluation of Text-to-Video Alignment via Fine-grained Question Generation and Answering
- EfficientMT Efficient Temporal Adaptation for Motion Transfer in Text-to-Video Diffusion Models
:star:code
- TITAN-Guide: Taming Inference-Time AligNment for Guided Text-to-Video Diffusion Models
- 图像-视频
- RealCam-I2V Real-World Image-to-Video Generation with Interactive Complex Camera Control
- TIP-I2V A Million-Scale Real Text and Image Prompt Dataset for Image-to-Video Generation
:house:project - GeoMan Temporally Consistent Human Geometry Estimation using Image-to-Video Diffusion
- I2V3D Controllable Image-to-video Generation with 3D Guidance
- Versatile Transition Generation with Image-to-Video Diffusion
- 视频合成
- Causal-Entity Reflected Egocentric Traffic Accident Video Synthesis
- Adversarial Distribution Matching for Diffusion Distillation Towards Efficient Image and Video Synthesis
- Turbo2K Towards Ultra-Efficient and High-Quality 2K Video Synthesis
- Synthetic Video Enhances Physical Fidelity in Video Synthesis
- VACE All-in-One Video Creation and Editing
- V.I.P. : Iterative Online Preference Distillation for Efficient Video Diffusion Models
:star:code
:star:code - Precise Action-to-Video Generation Through Visual Action Prompts
- AnimateAnyMesh A Feed-Forward 4D Foundation Model for Text-Driven Universal Mesh Animation
- MagicMotion Controllable Video Generation with Dense-to-Sparse Trajectory Guidance
- DOLLAR Few-Step Video Generation via Distillation and Latent Reward Optimization
- DLFR-Gen Diffusion-based Video Generation with Dynamic Latent Frame Rate
- MagicMirror ID-Preserved Video Generation in Video Diffusion Transformers
- Authentic 4D Driving Simulation with a Video Generation Model
- NoiseController Towards Consistent Multi-view Video Generation via Noise Decomposition and Collaboration
- DiTaiListener Controllable High Fidelity Listener Video Generation with Diffusion
:house:project - VLIPP Towards Physically Plausible Video Generation with Vision and Language Informed Physical Prior
:house:project :house:project - DropletVideo A Dataset and Approach to Explore Integral Spatio-Temporal Consistent Video Generation
:house:project - A Unified Framework for Industrial Cel-Animation Colorization with Temporal-Structural Awareness
- X-Dancer Expressive Music to Human Dance Video Generation
- The Best of Both Worlds Integrating Language Models and Diffusion Models for Video Generation
:house:project - Long Context Tuning for Video Generation
- Dual-Expert Consistency Model for Efficient and High-Quality Video Generation
- Reangle-A-Video 4D Video Generation as Video-to-Video Translation
:house:project - InstaDrive Instance-Aware Driving World Models for Realistic and Consistent Video Generation
- Importance-Based Token Merging for Efficient Image and Video Generation
- MotionAgent Fine-grained Controllable Video Generation via Motion Field Agent
- Video-T1 Test-time Scaling for Video Generation
- Unified Video Generation via Next-Set Prediction in Continuous Domain
- Phantom Subject-Consistent Video Generation via Cross-Modal Alignment
- VideoAuteur Towards Long Narrative Video Generation
- STIV Scalable Text and Image Conditioned Video Generation
- Generating Fast and Slow Scalable Parallel Video Generation with Video Interface Networks
- Puppet-Master Scaling Interactive Video Generation as a Motion Prior for Part-Level Dynamics
- Adaptive Caching for Faster Video Generation with Diffusion Transformers
- T2Bs Text-to-Character Blendshapes via Video Generation
- QuantCache Adaptive Importance-Guided Quantization with Hierarchical Latent and Layer Caching for Video Generation
- Free-Form Motion Control Controlling the 6D Poses of Camera and Objects in Video Generation
- FullDiT Video Generative Foundation Models with Multimodal Control via Full Attention
- 长视频合成
- 视频编辑
- DIVE Taming DINO for Subject-Driven Video Editing
- AnyPortal Zero-Shot Consistent Video Background Replacement
- QK-Edit Revisiting Attention-based Injection in MM-DiT for Image and Video Editing
- FiVE-Bench A Fine-grained Video Editing Benchmark for Evaluating Emerging Diffusion and Rectified Flow Models
:house:project - InsViE-1M Effective Instruction-based Video Editing with Elaborate Dataset Construction
- 图像编辑
- ReFlex: Text-Guided Editing of Real Images in Rectified Flow via Mid-Step Feature Extraction and Attention Adaptation
:star:code - ADIEE: Automatic Dataset Creation and Scorer for Instruction-Guided Image Editing Evaluation
:star:code - InstantEdit: Text-Guided Few-Step Image Editing with Piecewise Rectified Flow
- Exploring Multimodal Diffusion Transformers for Enhanced Prompt-based Image Editing
:house:project - ArtEditor Learning Customized Instructional Image Editor from Few-Shot Examples
- Zero-Shot Depth Aware Image Editing with Diffusion Models
- Early Timestep Zero-Shot Candidate Selection for Instruction-Guided Image Editing
- LUSD Localized Update Score Distillation for Text-Guided Image Editing
- Training-free Geometric Image Editing on Diffusion Models
:star:code - KV-Edit Training-Free Image Editing for Precise Background Preservation
- Streamlining Image Editing with Layered Diffusion Brushes
- EditCLIP Representation Learning for Image Editing
- Training-Free Text-Guided Image Editing with Visual Autoregressive Model
- UIP2P Unsupervised Instruction-based Image Editing via Edit Reversibility Constraint
- FramePainter Endowing Interactive Image Editing with Video Diffusion Priors
:star:code - Edicho Consistent Image Editing in the Wild
- Instruction-based Image Editing with Planning Reasoning and Generation
- LOCATEdit Graph Laplacian Optimized Cross Attention for Localized Text-Guided Image Editing
:star:code - Multi-turn Consistent Image Editing
:house:project - DCT-Shield A Robust Frequency Domain Defense against Malicious Image Editing
- RefEdit A Benchmark and Method for Improving Instruction-based Image Editing Model on Referring Expressions
- Describe Dont Dictate Semantic Image Editing with Natural Language Intent
- FreeFlux Understanding and Exploiting Layer-Specific Roles in RoPE-Based MMDiT for Versatile Image Editing
- Anchor Token Matching Implicit Structure Locking for Training-free AR Image Editing
- Addressing Text Embedding Leakage in Diffusion-based Image Editing
- EEdit Rethinking the Spatial and Temporal Redundancy for Efficient Image Editing
:star:code - SuperEdit Rectifying and Facilitating Supervision for Instruction-Based Image Editing
:star:code
- ReFlex: Text-Guided Editing of Real Images in Rectified Flow via Mid-Step Feature Extraction and Attention Adaptation
- 图像渐变
- 视频生成
- 3D布局
- 文本-3D
- SegmentDreamer: Towards High-fidelity Text-to-3D Synthesis with Segmented Consistency Trajectory Distillation
:star:code - Advancing Text-to-3D Generation with Linearized Lookahead Variational Score Distillation
- LEGO-Maker A Semantic-Driven Algorithm for Text-to-3D Generation
- Benchmarking and Learning Multi-Dimensional Quality Evaluator for Text-to-3D Generation
- VideoRFSplat Direct Scene-Level Text-to-3D Gaussian Splatting Generation with Flexible Pose and Multi-View Joint Modeling
:house:project :house:project - Generating Physically Stable and Buildable Brick Structures from Text
:house:project :house:project
- SegmentDreamer: Towards High-fidelity Text-to-3D Synthesis with Segmented Consistency Trajectory Distillation
- 视频-4D
- Gaussian Variation Field Diffusion for High-fidelity Video-to-4D Synthesis
:star:code
:star:code - Not All Frame Features Are Equal Video-to-4D Generation via Decoupling Dynamic-Static Features
:star:code :house:project - SV4D 20 Enhancing Spatio-Temporal Consistency in Multi-View Video Diffusion for High-Quality 4D Generation
- Gaussian Variation Field Diffusion for High-fidelity Video-to-4D Synthesis
- 故事生成
- 图像拼接
- 3D生成
- MS3D High-Quality 3D Generation via Multi-Scale Representation Modeling
- DSO Aligning 3D Generators with Simulation Feedback for Physical Soundness
- From One to More Contextual Part Latents for 3D Generation
:house:project :house:project - Repurposing 2D Diffusion Models with Gaussian Atlas for 3D Generation
- SViM3D Stable Video Material Diffusion for Single Image 3D Generation
- 布局生成图像
- CreatiLayout Siamese Multimodal Diffusion Transformer for Creative Layout-to-Image Generation
- MUSE Multi-Subject Unified Synthesis via Explicit Layout Semantic Expansion
:star:code - SEGA A Stepwise Evolution Paradigm for Content-Aware Layout Generation with Design Prior
:house:project - Lay-Your-Scene Natural Scene Layout Generation with Diffusion Transformers
- REPARO Compositional 3D Assets Generation with Differentiable 3D Layout Alignment
- 网格生成
- 草图生成
- 图像翻译
4.Image Captioning(图像字幕)
- CaptionSmiths: Flexibly Controlling Language Pattern in Image Captioning
:star:code - SC-Captioner: Improving Image Captioning with Self-Correction by Reinforcement Learning
- OmniDiff A Comprehensive Benchmark for Fine-grained Image Difference Captioning
- Embodied Image Captioning Self-supervised Learning Agents for Spatially Coherent Image Descriptions
:house:project - Engage for All Making Ordinary Image Descriptions Appealing Again
- 图表字幕
- 视频字幕
- Player-Centric Multimodal Prompt Generation for Large Language Model Based Identity-Aware Basketball Video Captioning
:star:code - Describe Anything Detailed Localized Image and Video Captioning
- SweetTok Semantic-Aware Spatial-Temporal Tokenizer for Compact Video Discretization
- Large-scale Pre-training for Grounded Video Caption Generation
:house:project
- Player-Centric Multimodal Prompt Generation for Large Language Model Based Identity-Aware Basketball Video Captioning
3.Super-Resolution(超分辨率)
- 图像超分辨率
- LightBSR: Towards Lightweight Blind Super-Resolution via Discriminative Implicit Degradation Representation Learning
:star:code - Bridging Diffusion Models and 3D Representations: A 3D Consistent Super-Resolution Framework
- IM-LUT: Interpolation Mixing Look-Up Tables for Image Super-Resolution
- Fine-structure Preserved Real-world Image Super-resolution via Transfer VAE Training
:star:code - ZFusion Efficient Deep Compositional Zero-shot Learning for Blind Image Super-Resolution with Generative Diffusion Prior
- StyleSRN Scene Text Image Super-Resolution with Text Style Embedding
- Fast Image Super-Resolution via Consistency Rectified Flow
- Outlier-Aware Post-Training Quantization for Image Super-Resolution
- Benchmarking Burst Super-Resolution for Polarization Images Noise Dataset and Analysis
:star:code - Hipandas Hyperspectral Image Joint Denoising and Super-Resolution by Image Fusion with the Panchromatic Image
:star:code - Not All Degradations Are Equal A Targeted Feature Denoising Framework for Generalizable Image Super-Resolution
- Emulating Self-attention with Convolution for Efficient Image Super-Resolution
- Rethinking the Upsampling Process in Light Field Super-Resolution with Spatial-Epipolar Implicit Image Function
:star:code - DuCos Duality Constrained Depth Super-Resolution via Foundation Model
:star:code - Generalized and Efficient 2D Gaussian Splatting for Arbitrary-scale Super-Resolution
:star:code - Reference-based Super-Resolution via Image-based Retrieval-Augmented Generation Diffusion
:star:code - Diffusion Transformer meets Multi-level Wavelet Spectrum for Single Image Super-Resolution
- PatchScaler An Efficient Patch-Independent Diffusion Model for Image Super-Resolution
:star:code :house:project - DiT4SR Taming Diffusion Transformer for Real-World Image Super-Resolution
- NeurOp-Diff Continuous Remote Sensing Image Super-Resolution via Neural Operator Diffusion
:star:code - Consistency Trajectory Matching for One-Step Generative Super-Resolution
- Adversarial Purification via Super-Resolution and Diffusion
- Perceive Understand and Restore Real-World Image Super-Resolution with Autoregressive Multimodal Generative Models
:star:code
- LightBSR: Towards Lightweight Blind Super-Resolution via Discriminative Implicit Degradation Representation Learning
- 视频超分辨率
- VSRM: A Robust Mamba-Based Framework for Video Super-Resolution
- TurboVSR: Fantastic Video Upscalers and Where to Find Them
- MedVSR Medical Video Super-Resolution with Cross State-Space Propagation
:star:code - Blind Video Super-Resolution based on Implicit Kernels
:star:code - DiffVSR Revealing an Effective Recipe for Taming Robust Video Super-Resolution Against Complex Degradations
- LDIP Long Distance Information Propagation for Video Super-Resolution
- STAR Spatial-Temporal Augmentation with Text-to-Video Models for Real-World Video Super-Resolution
2.Image Progress(图像/视频处理)
- LightsOut Diffusion-based Outpainting for Enhanced Lens Flare Removal
- Structure-Guided Diffusion Models for High-Fidelity Portrait Shadow Removal
:star:code - 去雨
- 去噪
- Consistent Time-of-Flight Depth Denoising via Graph-Informed Geometric Attention
:star:code - Fewer Denoising Steps or Cheaper Per-Step Inference: Towards Compute-Optimal Diffusion Model Deployment
:star:code - Robust Test-Time Adaptation for Single Image Denoising Using Deep Gaussian Prior
:star:code - Autoregressive Denoising Score Matching is a Good Video Anomaly Detector
- Blind2Sound Self-Supervised Image Denoising without Residual Noise
- Self-Calibrated Variance-Stabilizing Transformations for Real-World Image Denoising
- Denoising Token Prediction in Masked Autoregressive Models
- IDF Iterative Dynamic Filtering Networks for Generalizable Image Denoising
- Generic Event Boundary Detection via Denoising Diffusion
- Fewer Denoising Steps or Cheaper Per-Step Inference Towards Compute-Optimal Diffusion Model Deployment
:star:code
- Consistent Time-of-Flight Depth Denoising via Graph-Informed Geometric Attention
- 去模糊
- 图像去雾
- Frequency Domain-Based Diffusion Model for Unpaired Image Dehazing
- When Schrödinger Bridge Meets Real-World Image Dehazing with Unpaired Training
:star:code - PHATNet: A Physics-guided Haze Transfer Network for Domain-adaptive Real-world Image Dehazing
- GenHaze Pioneering Controllable One-Step Realistic Haze Generation for Real-World Dehazing
- HazeFlow Revisit Haze Physical Model as ODE and Non-Homogeneous Haze Generation for Real-World Dehazing
- 修补
- DiGA3D: Coarse-to-Fine Diffusional Propagation of Geometry and Appearance for Versatile 3D Inpainting
:star:code - RI3D Few-Shot Gaussian Splatting With Repair and Inpainting Diffusion Priors
- SAGI Semantically Aligned and Uncertainty Guided AI Image Inpainting
:house:project - OmniPaint Mastering Object-Oriented Editing via Disentangled Insertion-Removal Inpainting
- Inpaint4Drag Repurposing Inpainting Models for Drag-Based Image Editing via Bidirectional Warping
:house:project - Trans-Adapter A Plug-and-Play Framework for Transparent Image Inpainting
- Perspective-aware 3D Gaussian Inpainting with Multi-view Consistency
:house:project - Ultra High-Resolution Image Inpainting with Patch-Based Content Consistency Adapter
:star:code
- DiGA3D: Coarse-to-Fine Diffusional Propagation of Geometry and Appearance for Versatile 3D Inpainting
- 图像恢复
- EAMamba: Efficient All-Around Vision State Space Model for Image Restoration
- Unsupervised Part Discovery via Descriptor-Based Masked Image Restoration with Optimized Constraints
:star:code - Robust Adverse Weather Removal via Spectral-based Spatial Grouping
- Exploiting Diffusion Prior for Task-driven Image Restoration
- Reverse Convolution and Its Applications to Image Restoration
:star:code - Enhancing Image Restoration Transformer via Adaptive Translation Equivariance
- MP-HSIR A Multi-Prompt Framework for Universal Hyperspectral Image Restoration
:star:code - FoundIR Unleashing Million-scale Training Data to Advance Foundation Models for Image Restoration
- MOERL When Mixture-of-Experts Meet Reinforcement Learning for Adverse Weather Image Restoration
- Conditional Visual Autoregressive Modeling for Pathological Image Restoration
- Dual-level Prototype Learning for Composite Degraded Image Restoration
- UniRes Universal Image Restoration for Complex Degradations
- Devil is in the Uniformity Exploring Diverse Learners within Transformer for Image Restoration
- A Plug-and-Play Physical Motion Restoration Approach for In-the-Wild High-Difficulty Motions
:house:project - MIORe VAR-MIORe Benchmarks to Push the Boundaries of Restoration
- Decouple to Reconstruct High Quality UHD Restoration via Active Feature Disentanglement and Reversible Fusion
- Robust Low-light Scene Restoration via Illumination Transition
:house:project - Noise-Modeled Diffusion Models for Low-Light Spike Image Restoration
:star:code - LD-RPS Zero-Shot Unified Image Restoration via Latent Diffusion Recurrent Posterior Sampling
:star:code - Frequency-Guided Posterior Sampling for Diffusion-Based Image Restoration
- 图像/视频增强
- MobileIE: An Extremely Lightweight and Effective ConvNet for Real-Time Image Enhancement on Mobile Devices
:star:code - CWNet: Causal Wavelet Network for Low-Light Image Enhancement
:star:code - Learning Pixel-adaptive Multi-layer Perceptrons for Real-time Image Enhancement
- GT-Mean Loss: A Simple Yet Effective Solution for Brightness Mismatch in Low-Light Image Enhancement
:star:code - Learnable Feature Patches and Vectors for Boosting Low-light Image Enhancement without External Knowledge
- Uncover Treasures in DCT Advancing JPEG Quality Enhancement by Exploiting Latent Correlations
- Lightweight and Fast Real-time Image Enhancement via Decomposition of the Spatial-aware Lookup Tables
:star:code - GM-MoE Low-Light Enhancement with Gated-Mechanism Mixture-of-Experts
:star:code - Exploring View Consistency for Scene-Adaptive Low-Light Light Field Image Enhancement
- Low-Light Image Enhancement Using Event-Based Illumination Estimation
:star:code - Task-Decoupled Bezier Surface Constraint for Uneven Low-Light Image Enhancement
- PASD A Pixel-Adaptive Swarm Dynamics Approach for Unsupervised Low-Light Image Enhancement
- From Enhancement to Understanding Build a Generalized Bridge for Low-light Vision via Semantically Consistent Unsupervised Fine-tuning
- RetinexMCNet A Memory Controller Dominated Network for Low-Light Video Enhancement Based on Retinex
- Aligning Global Semantics and Local Textures in Generative Video Enhancement
- MobileIE: An Extremely Lightweight and Effective ConvNet for Real-Time Image Enhancement on Mobile Devices
- 视频调色
- 视频去模糊
- Event-guided Unified Framework for Low-light Video Enhancement Frame Interpolation and Deblurring
- Separation for Better Integration Disentangling Edge and Motion in Event-based Deblurring
- EVDM Event-based Real-world Video Deblurring with Mamba
- ClearSight Human Vision-Inspired Solutions for Event-Based Motion Deblurring
- 质量评估
- 视频修补
- 着色
- 外展
- 图像矫正
1.Other
- MMOne: Representing Multiple Modalities in One Scene
:star:code - PhysRig: Differentiable Physics-Based Skinning and Rigging Framework for Realistic Articulated Object Modeling
- EgoAdapt: Adaptive Multisensory Distillation and Policy Learning for Efficient Egocentric Perception
- Learning to See in the Extremely Dark
:star:code - Unlocking Constraints: Source-Free Occlusion-Aware Seamless Segmentation
:star:code - CA-I2P: Channel-Adaptive Registration Network with Global Optimal Selection
- Global and Local Entailment Learning for Natural World Imagery
:star:code - Attention to Burstiness: Low-Rank Bilinear Prompt Tuning
- Intervening in Black Box: Concept Bottleneck Model for Enhancing Human Neural Network Mutual Understanding
:star:code - CSD-VAR: Content-Style Decomposition in Visual Autoregressive Models
- Learning Counterfactually Decoupled Attention for Open-World Model Attribution
:star:code - Where, What, Why: Towards Explainable Driver Attention Prediction
- VolumetricSMPL: A Neural Volumetric Body Model for Efficient Interactions, Contacts, and Collisions
:star:code - AFUNet: Cross-Iterative Alignment-Fusion Synergy for HDR Reconstruction via Deep Unfolding Paradigm
:star:code - HiNeuS High-fidelity Neural Surface Mitigating Low-texture and Reflective Ambiguity
:house:project - Rectifying Magnitude Neglect in Linear Attention
:star:code - Beyond Low-Rank Tuning: Model Prior-Guided Rank Allocation for Effective Transfer in Low-Data and Large-Gap Regimes
:star:code - Zero-shot Inexact CAD Model Alignment from a Single Image
:star:code - MGSfM: Multi-Camera Geometry Driven Global Structure-from-Motion
:star:code - Learning Normals of Noisy Points by Local Gradient-Aware Surface Filtering
:star:code - Pose-Star: Anatomy-Aware Editing for Open-World Fashion Images
- Unlearning the Noisy Correspondence Makes CLIP More Robust
:star:code - Less is More: Empowering GUI Agent with Context-Aware Simplification
- Voyaging into Unbounded Dynamic Scenes from a Single View
:star:code - TeethGenerator: A two-stage framework for paired pre- and post-orthodontic 3D dental data generation
:star:code - Beyond One Shot, Beyond One Perspective: Cross-View and Long-Horizon Distillation for Better LiDAR Representations
:star:code - IAP: Invisible Adversarial Patch Attack through Perceptibility-Aware Localization and Perturbation Optimization
- Synchronizing Task Behavior: Aligning Multiple Tasks during Test-Time Training
- From Enhancement to Understanding: Build a Generalized Bridge for Low-light Vision via Semantically Consistent Unsupervised Fine-tuning
- ConsNoTrainLoRA: Data-driven Weight Initialization of Low-rank Adapters using Constraints
- DAA*: Deep Angular A Star for Image-based Path Planning
- Visual Surface Wave Elastography: Revealing Subsurface Physical Properties via Visible Surface Waves
- GT-Loc: Unifying When and Where in Images Through a Joint Embedding Space
- Supercharging Floorplan Localization with Semantic Rays
- Imbalance in Balance: Online Concept Balancing in Generation Models
- MinCD-PnP: Learning 2D-3D Correspondences with Approximate Blind PnP
:star:code - DAViD: Data-efficient and Accurate Vision Models from Synthetic Data
:house:project - DCHM: Depth-Consistent Human Modeling for Multiview Detection
:star:code - Open-set Cross Modal Generalization via Multimodal Unified Representation
:star:code - FreeCus: Free Lunch Subject-driven Customization in Diffusion Transformers
:star:code - M-SpecGene: Generalized Foundation Model for RGBT Multispectral Vision
:star:code - Large Learning Rates Simultaneously Achieve Robustness to Spurious Correlations and Compressibility
- Joint Asymmetric Loss for Learning with Noisy Labels
:star:code - AnimalClue: Recognizing Animals by their Traces
:star:code - Rethinking Few Shot CLIP Benchmarks: A Critical Analysis in the Inductive Setting
- Controllable Feature Whitening for Hyperparameter-Free Bias Mitigation
- Learning to See Inside Opaque Liquid Containers using Speckle Vibrometry
- Towards Scalable Spatial Intelligence via 2D-to-3D Data Lifting
- CoopTrack: Exploring End-to-End Learning for Efficient Cooperative Sequential Perception
:star:code - TR-PTS: Task-Relevant Parameter and Token Selection for Efficient Tuning
:star:code - ShortFT: Diffusion Model Alignment via Shortcut-based Fine-Tuning
- DiffuMatch: Category-Agnostic Spectral Diffusion Priors for Robust Non-rigid Shape Matching
:star:code - SUB: Benchmarking CBM Generalization via Synthetic Attribute Substitutions
:star:code
:house:project - Towards Higher Effective Rank in Parameter-efficient Fine-tuning using Khatri--Rao Product
- DC-AE 1.5: Accelerating Diffusion Model Convergence with Structured Latent Space
:star:code - CoST: Efficient Collaborative Perception From Unified Spatiotemporal Perspective
- GeoExplorer: Active Geo-localization with Curiosity-Driven Exploration
:star:code
:star:code - Where am I Cross-View Geo-localization with Natural Language Descriptions
:star:code - SCFlow: Implicitly Learning Style and Content Disentanglement with Flow Models
:star:code
:star:code - How Would It Sound? Material-Controlled Multimodal Acoustic Profile Generation for Indoor Scenes
:star:code
:star:code - Symmetry Understanding of 3D Shapes via Chirality Disentanglement
:star:code
:star:code - kh Symmetry Understanding of 3D Shapes via Chirality Disentanglement
:house:project :house:project - WIR3D Visually-Informed and Geometry-Aware 3D Shape Abstraction
- Learning an Implicit Physics Model for Image-based Fluid Simulation
:star:code - The Escalator Problem: Identifying Implicit Motion Blindness in AI for Accessibility
- Deep Space Weather Model: Long-Range Solar Flare Prediction from Multi-Wavelength Images
:star:code
:star:code - Forecasting Continuous Non-Conservative Dynamical Systems in SO(3)
:star:code - CObL: Toward Zero-Shot Ordinal Layering without User Prompting
:house:project - UniConvNet: Expanding Effective Receptive Field while Maintaining Asymptotically Gaussian Distribution for ConvNets of Any Scale
:star:code - TRACE: Learning 3D Gaussian Physical Dynamics from Multi-view Videos
:star:code - Combinative Matching for Geometric Shape Assembly
:star:code - Harnessing Input-Adaptive Inference for Efficient VLN
:star:code - Less-to-More Generalization: Unlocking More Controllability by In-Context Generation
:star:code - Towards a Unified Copernicus Foundation Model for Earth Vision
:star:code - UnZipLoRA Separating Content and Style from a Single Image
- FlowDPS Flow-Driven Posterior Sampling for Inverse Problems
- Closed-Loop Transfer for Weakly-supervised Affordance Grounding
- ReconDreamer Harmonizing Generative and Reconstructive Models for Driving Scene Representation
- Generative Zoo
- PAN-Crafter Learning Modality-Consistent Alignment for PAN-Sharpening
- SANA-Sprint One-Step Diffusion with Continuous-Time Consistency Distillation
- Erasing More Than Intended How Concept Erasure Degrades the Generation of Non-Target Concepts
- Demeter A Parametric Model of Crop Plant Morphology from the Real World
- S3E Self-Supervised State Estimation for Radar-Inertial System
- PROGRESSOR A Perceptually Guided Reward Estimator with Self-Supervised Online Refinement
- SHeaP Self-Supervised Head Geometry Predictor Learned via 2D Gaussians
- Cooperative Pseudo Labeling for Unsupervised Federated Classification
:star:code - SignRep Enhancing Self-Supervised Sign Representations
- Self-Supervised Sparse Sensor Fusion for Long Range Perception
- AIM Amending Inherent Interpretability via Self-Supervised Masking
- Unsupervised Identification of Protein Compositions and Conformations via Implicit Content-Transformation Disentanglement
- Efficient Unsupervised Shortcut Learning Detection and Mitigation in Transformers
- DASH 4D Hash Encoding with Self-Supervised Decomposition for Real-Time Dynamic Scene Rendering
:star:code - DIP Unsupervised Dense In-Context Post-training of Visual Representations
- Progressive Distribution Bridging Unsupervised Adaptation for Large-scale Pre-trained Models via Adaptive Auxiliary Data
- GloPER Unsupervised Animal Pattern Extraction from Local Reconstruction
- Gain-MLP Improving HDR Gain Map Encoding via a Lightweight MLP
- Joint Semantic and Rendering Enhancements in 3D Gaussian Modeling with Anisotropic Local Encoding
- AdaptiveAE An Adaptive Exposure Strategy for HDR Capturing in Dynamic Scenes
- Robust Unfolding Network for HDR Imaging with Modulo Cameras
- DEPTHOR Depth Enhancement from a Practical Light-Weight dToF Sensor and RGB Image
:star:code - GaSLight Gaussian Splats for Spatially-Varying Lighting in HDR
:house:project - Neural Compression for 3D Geometry Sets
:star:code - Wide2Long Learning Lens Compression and Perspective Adjustment for Wide-Angle to Telephoto Translation
- Dataset Distillation as Data Compression A Rate-Utility Perspective
- SIMS Simulating Stylized Human-Scene Interactions with Retrieval-Augmented Script Generation
- Integrating Visual Interpretation and Linguistic Reasoning for Geometric Problem Solving
:star:code - VisRL Intention-Driven Visual Perception via Reinforced Reasoning
- MINERVA Evaluating Complex Video Reasoning
:star:code - From Easy to Hard The MIR Benchmark for Progressive Interleaved Multi-Image Reasoning
- VideoSetDiff Identifying and Reasoning Similarities and Differences in Similar Videos
- Unsupervised Visual Chain-of-Thought Reasoning via Preference Optimization
- DWIM Towards Tool-aware Visual Reasoning via Discrepancy-aware Workflow Generation Instruct-Masking Tuning
- A Unified Framework for Motion Reasoning and Generation in Human Interaction
- VEGGIE Instructional Editing and Reasoning Video Concepts with Grounded Generation
- 3DSRBench A Comprehensive 3D Spatial Reasoning Benchmark
- MMCR Benchmarking Cross-Source Reasoning in Scientific Papers
- Unveiling the Invisible Reasoning Complex Occlusions Amodally with AURA
:house:project - VRBench A Benchmark for Multi-Step Reasoning in Long Narrative Videos
- A Visual Leap in CLIP Compositionality Reasoning through Generation of Counterfactual Sets
- From Linearity to Non-Linearity How Masked Autoencoders Capture Spatial Correlations
- X-Capture An Open-Source Portable Device for Multi-Sensory Learning
- MonoMobility Zero-Shot 3D Mobility Analysis from Monocular Videos
- Zero-Shot Vision Encoder Grafting via LLM Surrogates
- Few-Shot Pattern Detection via Template Matching and Regression
- Sparsity Outperforms Low-Rank Projections in Few-Shot Adaptation
- FIND Few-Shot Anomaly Inspection with Normal-Only Multi-Modal Data
- SpikeDiff Zero-shot High-Quality Video Reconstruction from Chromatic Spike Camera and Sub-millisecond Spike Streams
- TikZero Zero-Shot Text-Guided Graphics Program Synthesis
- FontAnimate High Quality Few-shot Font Generation via Animating Font Transfer Process
- MAVFlow Preserving Paralinguistic Elements with Conditional Flow Matching for Zero-Shot AV2AV Multilingual Translation
:star:code - Unknown Text Learning for CLIP-based Few-Shot Open-set Recognition
- Zero-Shot Compositional Video Learning with Coding Rate Reduction
- Deeply Supervised Flow-Based Generative Models
- X2I Seamless Integration of Multimodal Understanding into Diffusion Transformer via Attention Distillation
- Continual Personalization for Diffusion Models
- Federated Continual Instruction Tuning
:star:code - Hybrid-TTA Continual Test-time Adaptation via Dynamic Domain Shift Detection
- SMoLoRA Exploring and Defying Dual Catastrophic Forgetting in Continual Visual Instruction Tuning
- A Framework for Double-Blind Federated Adaptation of Foundation Models
:star:code - Stable Score Distillation
:star:code - LoRA-FAIR Federated LoRA Fine-Tuning with Aggregation and Initialization Refinement
- Class-Wise Federated Averaging for Efficient Personalization
- EFTViT Efficient Federated Training of Vision Transformers with Masked Images on Resource-Constrained Clients
- Tensor-aggregated LoRA in Federated Fine-tuning
- CMAD Correlation-Aware and Modalities-Aware Distillation for Multimodal Sentiment Analysis with Missing Modalities
:star:code - Federated Prompt-Tuning with Heterogeneous and Incomplete Multimodal Client Data
- GlassWizard Harvesting Diffusion Priors for Glass Surface Detection
- CLIPSym Delving into Symmetry Detection with CLIP
- Axis-level Symmetry Detection with Group-Equivariant Representation
- MistSense Versatile Online Detection of Procedural and Execution Mistakes
- Moderating the Generalization of Score-based Generative Model
:star:code - On the Generalization of Representation Uncertainty in Earth Observation
:star:code - Learning Few-Step Diffusion Models by Trajectory Distribution Matching
- Recovering Parametric Scenes from Very Few Time-of-Flight Pixels
- Can We Achieve Efficient Diffusion Without Self-Attention Distilling Self-Attention into Convolutions
- Predict-Optimize-Distill A Self-Improving Cycle for 4D Object Understanding
- ArgoTweak Towards Self-Updating HD Maps through Structured Priors
:house:project - ILLUME Illuminating Your LLMs to See Draw and Self-Enhance
- Iris Breaking GUI Complexity with Adaptive Focus and Self-Refining
- Bridging the Gap between Brain and Machine in Interpreting Visual Semantics Towards Self-adaptive Brain-to-Text Decoding
:star:code - Learning Neural Scene Representation from iToF Imaging
- ACE-G Improving Generalization of Scene Coordinate Regression Through Query Pre-Training
- Teleportraits Training-Free People Insertion into Any Scene
- From Image to Video An Empirical Study of Diffusion Representations
- REGEN Learning Compact Video Embedding with (Re-)Generative Decoder
- Generative Video Bi-flow
- VISION-XL High Definition Video Inverse Problem Solver using Latent Image Diffusion Models
:house:project - TACO Taming Diffusion for in-the-wild Video Amodal Completion
- Free-MoRef Instantly Multiplexing Context Perception Capabilities of Video-MLLMs within Single Inference
:star:code - MultiModal Action Conditioned Video Simulation
- Aligning Moments in Time using Video Queries
- Make Your Training Flexible Towards Deployment-Efficient Video Models
:star:code - Stereo Any Video Temporally Consistent Stereo Matching
- SAFT Shape and Appearance of Fabrics from Template via Differentiable Physical Simulations from Monocular Video
- Everything is a Video Unifying Modalities through Next-Frame Prediction
- FlowStyler Artistic Video Stylization via Transformation Fields Transports
- Long-Context State-Space Video World Models
- PVChat Personalized Video Chat with One-Shot Learning
- Instance-Level Video Depth in Groups Beyond Occlusions
- SKALD Learning-Based Shot Assembly for Coherent Multi-Shot Video Creation
- Preacher Paper-to-Video Agentic System
- TemCoCo Temporally Consistent Multi-modal Video Fusion with Visual-Semantic Collaboration
:star:code - Spatial Alignment and Temporal Matching Adapter for Video-Radar Remote Physiological Measurement
- Light-A-Video Training-free Video Relighting via Progressive Light Fusion
- VoiceCraft-Dub Automated Video Dubbing with Neural Codec Language Models
- RnGCam High-speed video from rolling global shutter measurements
- AnnofreeOD Detecting All Classes at Low Frame Rates Without Human Annotations
- Expressive Talking Human from Single-Image with Imperfect Priors
- Human-in-the-Loop Local Corrections of 3D Scene Layouts via Infilling
- UniPortrait A Unified Framework for Identity-Preserving Single- and Multi-Human Image Personalization
- DreamDance Animating Human Images by Enriching 3D Geometry Cues from 2D Poses
- ATLAS Decoupling Skeletal and Shape Parameters for Expressive Parametric Human Modeling
- CompleteMe Reference-based Human Image Completion
- Visual Interestingness Decoded How GPT-4o Mirrors Human Interests
- 2HandedAfforder Learning Precise Actionable Bimanual Affordances from Human Videos
- Unraveling the Smoothness Properties of Diffusion Models A Gaussian Mixture Perspective
- GausSim Foreseeing Reality by Gaussian Simulator for Elastic Objects
- 3DGS-LM Faster Gaussian-Splatting Optimization with Levenberg-Marquardt
- RhythmGuassian Repurposing Generalizable Gaussian Model For Remote Physiological Measurement
:star:code - GaussianReg Rapid 2D3D Registration for Emergency Surgery via Explicit 3D Modeling with Gaussian Primitives
:star:code - Learning Efficient and Generalizable Human Representation with Human Gaussian Model
- VoluMe - Authentic 3D Video Calls from Live Gaussian Splat Prediction
- MorphoGen Efficient Unconditional Generation of Long-Range Projection Neuronal Morphology via a Global-to-Local Framework
:star:code - Less-to-More Generalization Unlocking More Controllability by In-Context Generation
:star:code - Hi3DGen High-fidelity 3D Geometry Generation from Images via Normal Bridging
- The Curse of Conditions Analyzing and Improving Optimal Transport for Conditional Flow-Based Generation
- SynCity Training-Free Generation of 3D Worlds
- EvolvingGrasp Evolutionary Grasp Generation via Efficient Preference Alignment
- NuiScene Exploring Efficient Generation of Unbounded Outdoor Scenes
- RomanTex Decoupling 3D-aware Rotary Positional Embedded Multi-Attention Network for Texture Synthesis
- DreamCube RGB-D Panorama Generation via Multi-plane Synchronization
- Aligning Constraint Generation with Design Intent in Parametric CAD
- Multidimensional Byte Pair Encoding Shortened Sequences for Improved Visual Data Generation
- Oasis One Image is All You Need for Multimodal Instruction Data Synthesis
- MetaMorph Multimodal Understanding and Generation via Instruction Tuning
- Training-free Generation of Temporally Consistent Rewards from VLMs
:house:project - Text2Outfit Controllable Outfit Generation with Multimodal Language Models
- Can Knowledge be Transferred from Unimodal to Multimodal Investigating the Transitivity of Multimodal Knowledge Editing
- Harmonizing Visual Representations for Unified Multimodal Understanding and Generation
:star:code - HERO Human Reaction Generation from Videos
- Diffuman4D 4D Consistent Human View Synthesis from Sparse-View Videos with Spatio-Temporal Diffusion Models
:house:project - PerLDiff Controllable Street View Synthesis Using Perspective-Layout Diffusion Model
- Stable Virtual Camera Generative View Synthesis with Diffusion Models
- What Makes for Text to 360-degree Panorama Generation with Stable Diffusion
- Latent Swap Joint Diffusion for 2D Long-Form Latent Generation
- LiON-LoRA Rethinking LoRA Fusion to Unify Controllable Spatial and Temporal Generation for Video Diffusion
- DreamLayer Simultaneous Multi-Layer Generation via Diffusion Model
- StreamDiffusion A Pipeline-level Solution for Real-Time Interactive Generation
- LayerTracer Cognitive-Aligned Layered SVG Synthesis via Diffusion Transformer
- MatchDiffusion Training-free Generation of Match-Cuts
- SpinMeRound Consistent Multi-View Identity Generation Using Diffusion Models
- Omegance A Single Parameter for Various Granularities in Diffusion-Based Synthesis
:star:code - TaxaDiffusion Progressively Trained Diffusion Model for Fine-Grained Species Generation
- Ph-GAN Physics-Inspired GAN for Generating SAR Images Under Limited Data
- Multimodal Latent Diffusion Model for Complex Sewing Pattern Generation
- Orchid Image Latent Diffusion for Joint Appearance and Geometry Generation
- RAGDiffusion Faithful Cloth Generation via External Knowledge Assimilation
- Controllable Weather Synthesis and Removal with Video Diffusion Models
- SG-LDM Semantic-Guided LiDAR Generation via Latent-Aligned Diffusion
- ContraGS Codebook-Condensed and Trainable Gaussian Splatting for Fast Memory-Efficient Reconstruction
- StreamGS Online Generalizable Gaussian Splatting Reconstruction for Unposed Image Streams
- EvaGaussians Event Stream Assisted Gaussian Splatting from Blurry Images
- GeoSplatting Towards Geometry Guided Gaussian Splatting for Physically-based Inverse Rendering
- Splat-LOAM Gaussian Splatting LiDAR Odometry and Mapping
- 7DGS Unified Spatial-Temporal-Angular Gaussian Splatting
- STD-GS Exploring Frame-Event Interaction for SpatioTemporal-Disentangled Gaussian Splatting to Reconstruct High-Dynamic Scene
- GS-ID Illumination Decomposition on Gaussian Splatting via Adaptive Light Aggregation and Diffusion-Guided Material Priors
- Gaussian Splatting with Discretized SDF for Relightable Assets
- GS-LIVM Real-Time Photo-Realistic LiDAR-Inertial-Visual Mapping with Gaussian Splatting
- Self-Calibrating Gaussian Splatting for Large Field-of-View Reconstruction
:house:project :house:project - 2D Gaussian Splatting-based Sparse-view Transparent Object Depth Reconstruction via Physics Simulation for Scene Update
:house:project - EVER Exact Volumetric Ellipsoid Rendering for Real-time View Synthesis
- X2-Gaussian 4D Radiative Gaussian Splatting for Continuous-time Tomographic Reconstruction
:star:code - GaussianVideo Efficient Video Representation via Hierarchical Gaussian Splatting
- LUDVIG Learning-Free Uplifting of 2D Visual Features to Gaussian Splatting Scenes
- RogSplat Robust Gaussian Splatting via Generative Priors
- Instant GaussianImage A Generalizable and Self-Adaptive Image Representation via 2D Gaussian Splatting
- Seam360GS Seamless 360deg Gaussian Splatting from Real-World Omnidirectional Images
- Tile-wise vs Image-wise Random-Tile Loss and Training Paradigm for Gaussian Splatting
- Subjective Camera 10 Bridging Human Cognition and Visual Reconstruction through Sequence-Aware Sketch-Guided Diffusion
- LeanVAE An Ultra-Efficient Reconstruction VAE for Video Diffusion Models
:star:code - ReassembleNet Learnable Keypoints and Diffusion for 2D Fresco Reconstruction
- Event-guided HDR Reconstruction with Diffusion Priors
- CryoFastAR Fast Cryo-EM Ab initio Reconstruction Made Easy
- Physical Degradation Model-Guided Interferometric Hyperspectral Reconstruction with Unfolding Transformer
:star:code - NGD Neural Gradient Based Deformation for Monocular Garment Reconstruction
- Discretized Gaussian Representation for Tomographic Reconstruction
:star:code - CO2-Net A Physics-Informed Spatio-Temporal Model for Global Surface CO2 Reconstruction
:star:code - Sparfels Fast Reconstruction from Sparse Unposed Imagery
- Teeth Reconstruction and Performance Capture Using a Phone Camera
- PRM Photometric Stereo based Large Reconstruction Model
- Dual-S3D Hierarchical Dual-Path Selective SSM-CNN for High-Fidelity Implicit Reconstruction
- Long-LRM Long-sequence Large Reconstruction Model for Wide-coverage Gaussian Splats
:house:project :house:project - Neurons Emulating the Human Visual Cortex Improves Fidelity and Interpretability in fMRI-to-Video Reconstruction
:star:code - PhysTwin Physics-Informed Reconstruction and Simulation of Deformable Objects from Videos
:house:project :house:project - Boundary Probing for Input Privacy Protection When Using LMM Services
- FG-OrIU Towards Better Forgetting via Feature-Gradient Orthogonality for Incremental Unlearning
- CopyrightShield Enhancing Diffusion Model Security Against Copyright Infringement Attacks
- StolenLoRA Exploring LoRA Extraction Attacks via Synthetic Data
- Membership Inference Attacks with False Discovery Rate Control
- FastJSMA Accelerating Jacobian-based Saliency Map Attacks through Gradient Decoupling
- On the Robustness Tradeoff in Fine-Tuning
- MOSAIC Generating Consistent Privacy-Preserving Scenes from Multiple Depth Views in Multi-Room Environments
- AutoComPose Automatic Generation of Pose Transition Descriptions for Composed Pose Retrieval Using Multimodal LLMs
- REDUCIO Generating 1K Video within 16 Seconds using Extremely Compressed Motion Latents
:star:code - Video Motion Graphs
- DisCoRD Discrete Tokens to Continuous Motion via Rectified Flow Decoding
- Global Motion Corresponder for 3D Point-Based Scene Interpolation under Large Motion
- Long-term Traffic Simulation with Interleaved Autoregressive Motion and Scenario Generation
:house:project - Easi3R Estimating Disentangled Motion from DUSt3R Without Training
- Sequential Gaussian Avatars with Hierarchical Motion Context
:house:project :house:project - PoseSyn Synthesizing Diverse 3D Pose Data from In-the-Wild 2D Data
- PS-Mamba Spatial-Temporal Graph Mamba for Pose Sequence Refinement
- STaR Seamless Spatial-Temporal Aware Motion Retargeting with Penetration and Consistency Constraints
:star:code - EMD Explicit Motion Modeling for High-Quality Street Gaussian Splatting
:house:project - CoMoGaussian Continuous Motion-Aware Gaussian Splatting from Motion-Blurred Images
- Less is More Improving Motion Diffusion Models with Sparse Keyframes
- Bitrate-Controlled Diffusion for Disentangling Motion and Content in Video
- Uncalibrated Structure from Motion on a Sphere
:star:code - A Linear N-Point Solver for Structure and Motion from Asynchronous Tracks
- MikuDance Animating Character Art with Mixed Motion Dynamics
- What If Understanding Motion Through Sparse Interactions
- Disrupting Model Merging A Parameter-Level Defense Without Sacrificing Accuracy
:star:code - RadGPT Constructing 3D Image-Text Tumor Datasets
- DDB Diffusion Driven Balancing to Address Spurious Correlations
:star:code - Learnable Fractional Reaction-Diffusion Dynamics for Under-Display ToF Imaging and Beyond
:star:code - G2PDiffusion Cross-Species Genotype-to-Phenotype Prediction via Evolutionary Diffusion
- Spatial-Temporal Aware Visuomotor Diffusion Policy Learning
- CHORDS Diffusion Sampling Accelerator with Multi-core Hierarchical ODE Solvers
- EDiT Efficient Diffusion Transformers with Linear Compressed Attention
- GenHancer Imperfect Generative Models are Secretly Strong Vision-Centric Enhancers
- Diffusion Image Prior
- Diffusion Curriculum Synthetic-to-Real Data Curriculum via Image-Guided Diffusion
- Towards Stabilized and Efficient Diffusion Transformers through Long-Skip-Connections with Spectral Constraints
- IntroStyle Training-Free Introspective Style Attribution using Diffusion Features
- GeometryCrafter Consistent Geometry Estimation for Open-world Videos with Diffusion Priors
- Towards Robust Defense against Customization via Protective Perturbation Resistant to Diffusion-based Purification
- Rethinking Cross-Modal Interaction in Multimodal Diffusion Transformers
:star:code - Accelerating Diffusion Sampling via Exploiting Local Transition Coherence
:house:project - Flow to the Mode Mode-Seeking Diffusion Autoencoders for State-of-the-Art Image Tokenization
- UniPhys Unified Planner and Controller with Diffusion for Flexible Physics-Based Character Control
- A Simple yet Mighty Hartley Diffusion Versatilist for Generalizable Dense Vision Tasks
- JointDiT Enhancing RGB-Depth Joint Modeling with Diffusion Transformers
- REPA-E Unlocking VAE for End-to-End Tuning of Latent Diffusion Transformers
- Textured 3D Regenerative Morphing with 3D Diffusion Prior
- Unsupervised Imaging Inverse Problems with Diffusion Distribution Matching
- DICE Staleness-Centric Optimizations for Parallel Diffusion MoE Inference
:star:code - GameFactory Creating New Games with Generative Interactive Videos
- SA-LUT Spatial Adaptive 4D Look-Up Table for Photorealistic Style Transfer
- SparseVILA Decoupling Visual Sparsity for Efficient VLM Inference
- Trust but Verify Programmatic VLM Evaluation in the Wild
:house:project - GTR Guided Thought Reinforcement Prevents Thought Collapse in RL-based VLM Agent Training
- Token-Efficient VLM High-Resolution Image Understanding via Dynamic Region Proposal
- TerraMind Large-Scale Generative Multimodality for Earth Observation
:star:code - MUSE-VL Modeling Unified VLM through Semantic Discrete Encoding
- Radiant Foam Real-Time Differentiable Ray Tracing
- Attention to the Burstiness in Visual Prompt Tuning
- Enhancing Transformers Through Conditioned Embedded Tokens
- RALoc Enhancing Outdoor LiDAR Localization via Rotation Awareness
- HouseTour A Virtual Real Estate A(I)gent
- Towards Performance Consistency in Multi-Level Model Collaboration
- Polarimetric Neural Field via Unified Complex-Valued Wave Representation
- Visual Intention Grounding for Egocentric Assistants
- TinyViM Frequency Decoupling for Tiny Hybrid Vision Mamba
:star:code - A Recipe for Generating 3D Worlds from a Single Image
- Social Debiasing for Fair Multi-modal LLMs
- DOGR Towards Versatile Visual Document Grounding and Referring
:star:code - Meta-Learning Dynamic Center Distance Hard Sample Mining for Learning with Noisy Labels
- AstroLoc Robust Space to Ground Image Localizer
- Chimera Improving Generalist Model with Domain-Specific Experts
- Less is More Empowering GUI Agent with Context-Aware Simplification
- Learning Hierarchical Line Buffer for Image Processing
- Find Any Part in 3D
:house:project :house:project - VA-MoE Variables-Adaptive Mixture of Experts for Incremental Weather Forecasting
:star:code - Aether Geometric-Aware Unified World Modeling
- Any2AnyTryon Leveraging Adaptive Position Embeddings for Versatile Virtual Clothing Tasks
- Met2Net A Decoupled Two-Stage Spatio-Temporal Forecasting Model for Complex Meteorological Systems
- Memory-Efficient 4-bit Preconditioned Stochastic Optimization
- On the Recovery of Cameras from Fundamental Matrices
- Visual-RFT Visual Reinforcement Fine-Tuning
- E-SAM Training-Free Segment Every Entity Model
- SpatialSplat Efficient Semantic 3D from Sparse Unposed Images
- Is Less More Exploring Token Condensation as Training-free Test-time Adaptation
:star:code - CVPT Cross Visual Prompt Tuning
:star:code - Unfolding-Associative Encoder-Decoder Network with Progressive Alignment for Pansharpening
- CAD-Assistant Tool-Augmented VLLMs as Generic CAD Task Solvers
- HERMES temporal-coHERent long-forM understanding with Episodes and Semantics
:house:project - ImHead A Large-scale Implicit Morphable Model for Localized Head Modeling
- Sim-DETR Unlock DETR for Temporal Sentence Grounding
- Wavelet Policy Lifting Scheme for Policy Learning in Long-Horizon Tasks
- SeqGrowGraph Learning Lane Topology as a Chain of Graph Expansions
- CARP Visuomotor Policy Learning via Coarse-to-Fine Autoregressive Prediction
- Transformer-based Tooth Alignment Prediction with Occlusion and Collision Constraints
:house:project - Multi-modal Identity Extraction
- Arti-PG A Toolbox for Procedurally Synthesizing Large-Scale and Diverse Articulated Objects with Rich Annotations
:star:code - FlowChef Steering of Rectified Flow Models for Controlled Generations
:house:project - Semantic Equitable Clustering A Simple and Effective Strategy for Clustering Vision Tokens
- Towards Safer and Understandable Driver Intention Prediction
- AffordDexGrasp Open-set Language-guided Dexterous Grasp with Generalizable-Instructive Affordance
- Beyond the Limits Overcoming Negative Correlation of Activation-Based Training-Free NAS
- Cross-Subject Mind Decoding from Inaccurate Representations
- UINavBench A Framework for Comprehensive Evaluation of Interactive Digital Agents
- Knowledge Transfer from Interaction Learning
- GEOPARD Geometric Pretraining for Articulation Prediction in 3D Shapes
- ARMO Autoregressive Rigging for Multi-Category Objects
- HyperGCT A Dynamic Hyper-GNN-Learned Geometric Constraint for 3D Registration
- Improving Rectified Flow with Boundary Conditions
- Clink Chop Thud - Learning Object Sounds from Real-World Interactions
- ISP2HRNet Learning to Reconstruct High Resolution Image from Irregularly Sampled Pixels via Hierarchical Gradient Learning
:star:code - Decoupled Multi-Predictor Optimization for Inference-Efficient Model Tuning
- GARF Learning Generalizable 3D Reassembly for Real-World Fractures
- Parameter-Efficient Adaptation of Geospatial Foundation Models through Embedding Deflection
:star:code - MOSCATO Predicting Multiple Object State Change Through Actions
- GCAV A Global Concept Activation Vector Framework for Cross-Layer Consistency in Interpretability
:star:code - ModSkill Physical Character Skill Modularization
- DAMap Distance-aware MapNet for High Quality HD Map Construction
- Text2VDM Text to Vector Displacement Maps for Expressive and Interactive 3D Sculpting
- SuperMat Physically Consistent PBR Material Estimation at Interactive Rates
:house:project - Deep Adaptive Unfolded Network via Spatial Morphology Stripping and Spectral Filtration for Pan-sharpening
:star:code - Learning Normal Flow Directly From Events
:star:code - FuXi-RTM A Physics-Guided Prediction Framework with Radiative Transfer Modeling
- Enhanced Pansharpening via Quaternion Spatial-Spectral Interactions
:star:code - From Holistic to Localized Local Enhanced Adapters for Efficient Visual Instruction Fine-Tuning
- Spectral Image Tokenizer
- ObjectRelator Enabling Cross-View Object Relation Understanding Across Ego-Centric and Exo-Centric Perspectives
- Mitigating Object Hallucinations via Sentence-Level Early Intervention
:star:code - NETracer A Topology-Aware Iterative Tracing Approach for Tubular Structure Extraction
- SRefiner Soft-Braid Attention for Multi-Agent Trajectory Refinement
- Planar Affine Rectification from Local Change of Scale and Orientation
- Co-Painter Fine-Grained Controllable Image Stylization via Implicit Decoupling and Adaptive Injection
- TAR3D Creating High-Quality 3D Assets via Next-Part Prediction
- MeasureXpert Automatic Anthropometric Measurement Extraction from Two Unregistered Partial Posed and Dressed Body Scans
- AIRA Activation-Informed Low-Rank Adaptation for Large Models
:star:code - 3D Test-time Adaptation via Graph Spectral Driven Point Shift
- PlaceIt3D Language-Guided Object Placement in Real 3D Scenes
- TR-PTS Task-Relevant Parameter and Token Selection for Efficient Tuning
:star:code - Addressing Representation Collapse in Vector Quantized Models with One Linear Layer
- Jigsaw Imagining Complete Shape Priors for Object Reassembly
- WildSeg3D Segment Any 3D Objects in the Wild from 2D Images
:star:code - INSTINCT Instance-Level Interaction Architecture for Query-Based Collaborative Perception
:star:code - AnimeGamer Infinite Anime Life Simulation with Next Game State Prediction
- UIPro Unleashing Superior Interaction Capability For GUI Agents
- O-MaMa Learning Object Mask Matching between Egocentric and Exocentric Views
- EvRT-DETR Latent Space Adaptation of Image Detectors for Event-based Vision
- LIFT Latent Implicit Functions for Task- and Data-Agnostic Encoding
- Manual-PA Learning 3D Part Assembly from Instruction Diagrams
- PLMP - Point-Line Minimal Problems for Projective SfM
- Coordinate-based Speed of Sound Recovery for Aberration-Corrected Photoacoustic Computed Tomography
:house:project :house:project - Spatio-Spectral Pattern Illumination for Direct and Indirect Separation from a Single Hyperspectral Image
- Neural Solver of Dichromatic Reflection Model for Specular Highlight Removal
- Att-Adapter A Robust and Precise Domain-Specific Multi-Attributes T2I Diffusion Adapter via Conditional Variational Autoencoder
- CE-FAM Concept-Based Explanation via Fusion of Activation Maps
- Recover Biological Structure from Sparse-View Diffraction Images with Neural Volumetric Prior
- EgoAgent A Joint Predictive Agent Model in Egocentric Worlds
:star:code - Controllable Latent Space Augmentation for Digital Pathology
- Straighten Viscous Rectified Flow via Noise Optimization
- SL2A-INR Single-Layer Learnable Activation for Implicit Neural Representation
- LLaVA-3D A Simple yet Effective Pathway to Empowering LMMs with 3D Capabilities
- RESCUE Crowd Evacuation Simulation via Controlling SDM-United Characters
- WonderTurbo Generating Interactive 3D World in 072 Seconds
- Mastering Collaborative Multi-modal Data Selection A Focus on Informativeness Uniqueness and Representativeness
- PanoLlama Generating Endless and Coherent Panoramas with Next-Token-Prediction LLMs
- Dense Policy Bidirectional Autoregressive Learning of Actions
- Spectral Sensitivity Estimation with an Uncalibrated Diffraction Grating
- Removing Out-of-Focus Reflective Flares via Color Alignment
- CuMPerLay Learning Cubical Multiparameter Persistence Vectorizations
- Can Generative Geospatial Diffusion Models Excel as Discriminative Geospatial Foundation Models
:star:code - Improved Noise Schedule for Diffusion Training
- Sculpting Memory Multi-Concept Forgetting in Diffusion Models via Dynamic Mask and Concept-Aware Optimization
:star:code - From Reusing to Forecasting Accelerating Diffusion Models with TaylorSeers
:star:code - Degradation-Modeled Multipath Diffusion for Tunable Metalens Photography
:house:project - MamTiff-CAD Multi-Scale Latent Diffusion with Mamba for Complex Parametric Sequence
- Rethinking DPO-style Diffusion Aligning Frameworks
- Unified Multi-Agent Trajectory Modeling with Masked Trajectory Diffusion
- Understanding Flatness in Generative Models Its Role and Benefits
- SliderSpace Decomposing the Visual Capabilities of Diffusion Models
- FreeScale Unleashing the Resolution of Diffusion Models via Tuning-Free Scale Fusion
- Representing 3D Shapes with 64 Latent Vectors for 3D Diffusion Models
- End-to-End Multi-Modal Diffusion Mamba
- Guiding Noisy Label Conditional Diffusion Models with Score-based Discriminator Correction
- Make Me Happier Evoking Emotions Through Image Diffusion Models
- Guiding Diffusion Models with Adaptive Negative Sampling Without External Resources
- Differentially Private Fine-Tuning of Diffusion Models
:star:code - One-Step Specular Highlight Removal with Adapted Diffusion Models
- Entropy-Adaptive Diffusion Policy Optimization with Dynamic Step Alignment
- Beyond Blur A Fluid Perspective on Generative Diffusion Models
- ConceptSplit Decoupled Multi-Concept Personalization of Diffusion Models via Token-wise Adaptation and Attention Disentanglement
:star:code - IMG Calibrating Diffusion Models via Implicit Multimodal Guidance
:star:code - PEFTDiff Diffusion-Guided Transferability Estimation for Parameter-Efficient Fine-Tuning
- HiGarment Cross-modal Harmony Based Diffusion Model for Flat Sketch to Realistic Garment Image
:star:code - TREAD Token Routing for Efficient Architecture-agnostic Diffusion Training
- DiMO Distilling Masked Diffusion Models into One-step Generator
- An Inversion-based Measure of Memorization for Diffusion Models
- Adding Additional Control to One-Step Diffusion with Joint Distribution Matching
- PLADIS Pushing the Limits of Attention in Diffusion Models at Inference Time by Leveraging Sparsity
:star:code :house:project - X-Prompt Generalizable Auto-Regressive Visual Learning with In-Context Prompting
- X-Fusion Introducing New Modality to Frozen Large Language Models
- PRIMAL Physically Reactive and Interactive Motor Model for Avatar Learning
- Toward Material-Agnostic System Identification from Videos
- TransiT Transient Transformer for Non-line-of-sight Videography
- An Empirical Study of Autoregressive Pre-training from Videos
- Open-World Skill Discovery from Unsegmented Demonstration Videos
:house:project - Error Recognition in Procedural Videos using Generalized Task Graph
- From Gallery to Wrist Realistic 3D Bracelet Insertion in Videos
- Synchronization of Multiple Videos
:house:project - Vamba Understanding Hour-Long Videos with Hybrid Mamba-Transformers
- VideoOrion Tokenizing Object Dynamics in Videos
- Snakes and Ladders Two Steps Up for VideoMamba
- ViSpeak Visual Instruction Feedback in Streaming Videos
- Layer-wise Vision Injection with Disentangled Attention for Efficient LVLMs
:house:project :house:project - Teaching VLMs to Localize Specific Objects from In-context Examples
- PS3 A Multimodal Transformer Integrating Pathology Reports with Histology Images and Biological Pathways for Cancer Survival Prediction
- Benchmarking Multimodal CoT Reward Model Stepwise by Visual Program
- Exploring The Visual Feature Space for Multimodal Neural Decoding
- InfoBridge Balanced Multimodal Integration through Conditional Dependency Modeling
:house:project :house:project - Switch-a-View View Selection Learned from Unlabeled In-the-wild Videos
- VideoMiner Iteratively Grounding Key Frames of Hour-Long Videos via Tree-based Group Relative Policy Optimization
:star:code - Beyond the Frame Generating 360deg Panoramic Videos from Perspective Videos
- YOLOE Real-Time Seeing Anything
:star:code - CAFA a Controllable Automatic Foley Artist
- Neuroverse3D Developing In-Context Learning Universal Model for Neuroimaging in 3D
- COSTARR Consolidated Open Set Technique with Attenuation for Robust Recognition
- Agreement aware and dissimilarity oriented GLOM
- Local Scale Equivariance with Latent Deep Equilibrium Canonicalizer
- SP2T Sparse Proxy Attention for Dual-stream Point Transformer
- Hallucinatory Image Tokens A Training-free EAZY Approach to Detecting and Mitigating Object Hallucinations in LVLMs
- BabyVLM Data-Efficient Pretraining of VLMs Inspired by Infant Learning
- R1-Onevision Advancing Generalized Multimodal Reasoning through Cross-Modal Formalization
:star:code - EgoM2P Egocentric Multimodal Multitask Pretraining
- Scaling Laws for Native Multimodal Models
- Generate Transduct Adapt Iterative Transduction with VLMs
- LLaVA-PruMerge Adaptive Token Reduction for Efficient Large Multimodal Models
- HRScene How Far Are VLMs from Effective High-Resolution Image Understanding
- Unified Multimodal Understanding via Byte-Pair Visual Encoding
- FinMMR Make Financial Numerical Reasoning More Multimodal Comprehensive and Challenging
- Efficient Visual Place Recognition Through Multimodal Semantic Knowledge Integration
- RMultiplex200K Toward Reliable Multimodal Process Supervision for Visual Language Models on Telecommunications
- ERNet Efficient Non-Rigid Registration Network for Point Sequences
- Not all Views are Created Equal Analyzing Viewpoint Instabilities in Vision Foundation Models
- HumorDB Can AI understand graphical humor
- InfiniteYou Flexible Photo Recrafting While Preserving Your Identity
- Synchronizing Task Behavior Aligning Multiple Tasks during Test-Time Training
- Correspondence-Free Fast and Robust Spherical Point Pattern Registration
- Voyaging into Perpetual Dynamic Scenes from a Single View
:house:project - C4D 4D Made from 3D through Dual Correspondences
- Soft Local Completeness Rethinking Completeness in XAI
- When and Where do Data Poisons Attack Textual Inversion
:star:code :star:code2 - Quanta Neural Networks From Photons to Perception
- Auto-Regressive Transformation for Image Alignment
- A Unified Framework to BRIDGE Complete and Incomplete Deep Multi-View Clustering under Non-IID Missing Patterns
- CODA Repurposing Continuous VAEs for Discrete Tokenization
- AnyCalib On-Manifold Learning for Model-Agnostic Single-View Camera Calibration
:star:code - VAGUE Visual Contexts Clarify Ambiguous Expressions
- WIPES Wavelet-based Visual Primitives
- CanFields Consolidating Diffeomorphic Flows for Non-Rigid 4D Interpolation from Arbitrary-Length Sequences
:house:project - Towards Foundational Models for Single-Chip Radar
- VALLR Visual ASR Language Model for Lip Reading
- Enpowering Your Pansharpening Models with Generalizability Unified Distribution is All You Need
:star:code - Laboring on less labors RPCA Paradigm for Pan-sharpening
- DiMPLe - Disentangled Multi-Modal Prompt Learning Enhancing Out-Of-Distribution Alignment with Invariant and Spurious Feature Separation
- From Panels to Prose Generating Literary Narratives from Comics
- SparseFlex High-Resolution and Arbitrary-Topology 3D Shape Modeling
- Lidar Waveforms are Worth 40x128x33 Words
- Balanced Sharpness-Aware Minimization for Imbalanced Regression
- PRVQL Progressive Knowledge-guided Refinement for Robust Egocentric Visual Query Localization
:star:code - Bayesian-Inspired Space-Time Superpixels
- Hypergraph Clustering Network with Partial Attribute Imputation
- GECO Geometrically Consistent Embedding with Lightspeed Inference
- LoftUp Learning a Coordinate-Based Feature Upsampler for Vision Foundation Models
:star:code - Taming Flow Matching with Unbalanced Optimal Transport into Fast Pansharpening
- LLaFEA Frame-Event Complementary Fusion for Fine-Grained Spatiotemporal Understanding in LMMs
- Beyond cls Exploring the True Potential of Masked Image Modeling Representations
- SpikePack Enhanced Information Flow in Spiking Neural Networks with High Hardware Compatibility
- After the Party Navigating the Mapping From Color to Ambient Lighting
:star:code - A Differentiable Wave Optics Model for End-to-End Computational Imaging System Optimization
:star:code - Efficient Fine-Tuning of Large Models via Nested Low-Rank Adaptation
:star:code - SCAN Bootstrapping Contrastive Pre-training for Data Efficiency
:star:code - Visual Surface Wave Elastography Revealing Subsurface Physical Properties via Visible Surface Waves
- Discontinuity-aware Normal Integration for Generic Central Camera Models
- LazyMAR Accelerating Masked Autoregressive Models via Feature Caching
:star:code - SAC-GNC SAmple Consensus for adaptive Graduated Non-Convexity
- Generalized Tensor-based Parameter-Efficient Fine-Tuning via Lie Group Transformations
- Granular Concept Circuits Toward a Fine-Grained Circuit Discovery for Concept Representations
:star:code - Efficient Event Camera Data Pretraining with Adaptive Prompt Fusion
- Augmented Mass-Spring Model for Real-Time Dense Hair Simulation
- EYE3Turn Anything into Naked-eye 3D
- Boosting Class Representation via Semantically Related Instances for Robust Long-Tailed Learning with Noisy Labels
:star:code - Fast Globally Optimal and Geometrically Consistent 3D Shape Matching
- Is CLIP ideal No Can we fix it Yes
:star:code - SpiLiFormer Enhancing Spiking Transformers with Lateral Inhibition
:star:code - GroundingSuite Measuring Complex Multi-Granular Pixel Grounding
:star:code - TorchAdapt Towards Light-Agnostic Real-Time Visual Perception
- Edit360 2D Image Edits to 3D Assets from Any Angle
- RANKCLIP Ranking-Consistent Language-Image Pretraining
- Visual Test-time Scaling for GUI Agent Grounding
:star:code - Neural Shell Texture Splatting More Details and Fewer Primitives
- SILO Solving Inverse Problems with Latent Operators
- High-Precision 3D Measurement of Complex Textured Surfaces Using Multiple Filtering Approach
- DALIP Distribution Alignment-based Language-Image Pre-Training for Domain-Specific Data
- mmCooper A Multi-agent Multi-stage Communication-efficient and Collaboration-robust Cooperative Perception Framework
- Scendi Score Prompt-Aware Diversity Evaluation via Schur Complement of CLIP Embeddings
:star:code - Referring to Any Person
- More Reliable Pseudo-labels Better Performance A Generalized Approach to Single Positive Multi-label Learning
- TopicGeo An Efficient Unified Framework for Geolocation
- PVMamba Parallelizing Vision Mamba via Dynamic State Aggregation
- Staining and Locking Computer Vision Models Without Retraining
- Time-Aware Auto White Balance in Mobile Photography
- Background Invariance Testing According to Semantic Proximity
- ViT-Split Unleashing the Power of Vision Foundation Models via Efficient Splitting Heads
- VITAL More Understandable Feature Visualization through Distribution Alignment and Relevant Information Flow
- LATINO-PRO LAtent consisTency INverse sOlver with PRompt Optimization
- Towards a Universal Image Degradation Model via Content-Degradation Disentanglement
- Temperature in Cosine-based Softmax Loss
- GaussianProperty Integrating Physical Properties to 3D Gaussians with LMMs
- Auto-Regressively Generating Multi-View Consistent Images
:star:code - I Am Big You Are Little I Am Right You Are Wrong
- Spatially-Varying Autofocus
- LayerD Decomposing Raster Graphic Designs into Layers
- LangBridge Interpreting Image as a Combination of Language Embeddings
:house:project - LLaVA-CoT Let Vision Language Models Reason Step-by-Step
:star:code - ETCH Generalizing Body Fitting to Clothed Humans via Equivariant Tightness
- FRET Feature Redundancy Elimination for Test Time Adaptation
- PlaneRAS Learning Planar Primitives for 3D Plane Recovery
- Certifiably Optimal Anisotropic Rotation Averaging
- LaCoOT Layer Collapse through Optimal Transport
:star:code - Completing 3D Partial Assemblies with View-Consistent 2D-3D Correspondence
- SteerX Creating Any Camera-Free 3D and 4D Scenes with Geometric Steering
- Hierarchical Material Recognition from Local Appearance
- Mixture-of-Scores Robust Image-Text Data Valuation via Three Lines of Code
- DMesh An Efficient Differentiable Mesh for Complex Shapes
:house:project - Registration beyond Points General Affine Subspace Alignment via Geodesic Distance on Grassmann Manifold
:star:code - Online Language Splatting
- Lyra An Efficient and Speech-Centric Framework for Omni-Cognition
- Geometry Distributions
- Evading Data Provenance in Deep Neural Networks
- CA2C A Prior-Knowledge-Free Approach for Robust Label Noise Learning via Asymmetric Co-learning and Co-training
- StyleKeeper Prevent Content Leakage using Negative Visual Query Guidance
- Faster and Better 3D Splatting via Group Training
:house:project :house:project - Event-based Visual Vibrometry
- Robust Multi-View Learning via Representation Fusion of Sample-Level Attention and Alignment of Simulated Perturbation
:star:code - VSSD Vision Mamba with Non-Causal State Space Duality
:star:code - Know No Better A Data-Driven Approach for Enhancing Negation Awareness in CLIP
- MAVias Mitigate any Visual Bias
- Generalizable Non-Line-of-Sight Imaging with Learnable Physical Priors
:star:code - Contrastive Flow Matching
:star:code - Trial-Oriented Visual Rearrangement
- FusionPhys A Flexible Framework for Fusing Complementary Sensing Modalities in Remote Physiological Measurement
:star:code - NeuFrameQ Neural Frame Fields for Scalable and Generalizable Anisotropic Quadrangulation
- Principal Components Enable A New Language of Images
- Always Skip Attention
- FREE-Merging Fourier Transform for Efficient Model Merging
- JPEG Processing Neural Operator for Backward-Compatible Coding
:star:code - FlowTok Flowing Seamlessly Across Text and Image Tokens
:star:code - Magic Insert Style-Aware Drag-and-Drop
- PseudoMapTrainer Learning Online Mapping without HD Maps
:star:code - Optical Model-Driven Sharpness Mapping for Autofocus in Small Depth-of-Field and Severe Defocus Scenarios
- RoCo-Sim Enhancing Roadside Collaborative Perception through Foreground Simulation
- Metric Convolutions A Unifying Theory to Adaptive Image Convolutions
- Scaling Inference-Time Search with Vision Value Model for Improved Visual Comprehension
- Transparent Vision A Theory of Hierarchical Invariant Representations
- PRO-VPT Distribution-Adaptive Visual Prompt Tuning via Prompt Relocation
:star:code - Do It Yourself Learning Semantic Correspondence from Pseudo-Labels
- BATCLIP Bimodal Online Test-Time Adaptation for CLIP
:star:code - Align Your Rhythm Generating Highly Aligned Dance Poses with Gating-Enhanced Rhythm-Aware Feature Representation
- Neuromanifold-Regularized KANs for Shape-fair Feature Representations
:star:code :star:code2 - Free-running vs Synchronous Single-Photon Lidar for High-flux 3D Imaging
- On the Complexity-Faithfulness Trade-off of Gradient-Based Explanations
- DuoLoRA Cycle-consistent and Rank-disentangled Content-Style Personalization
- GFPack Attention-Driven Gradient Fields for Optimizing 2D Irregular Packing
:star:code - ShadowHack Hacking Shadows via Luminance-Color Divide and Conquer
- Intra-view and Inter-view Correlation Guided Multi-view Novel Class Discovery
- Pi-GPS Enhancing Geometry Problem Solving by Unleashing the Power of Diagrammatic Information
- Stochastic Interpolants for Revealing Stylistic Flows across the History of Art
:star:code - Thermal Polarimetric Multi-view Stereo
- Image Intrinsic Scale Assessment Bridging the Gap Between Quality and Resolution
- Relative Illumination Fields Learning Medium and Light Independent Underwater Scenes
- Color Matching Using Hypernetwork-Based Kolmogorov-Arnold Networks
- SITE towards Spatial Intelligence Thorough Evaluation
- FW-Merging Scaling Model Merging with Frank-Wolfe Optimization
- On the Provable Importance of Gradients for Autonomous Language-Assisted Image Clustering
- You Share Beliefs I Adapt Progressive Heterogeneous Collaborative Perception
- C2MIL Synchronizing Semantic and Topological Causalities in Multiple Instance Learning for Robust and Interpretable Survival Analysis
:star:code - SAMO A Lightweight Sharpness-Aware Approach for Multi-Task Optimization with Joint Global-Local Perturbation
- ViT-Linearizer Distilling Quadratic Knowledge into Linear-Time Vision Models
- Generalized Deep Multi-view Clustering via Causal Learning with Partially Aligned Cross-view Correspondence
- Large Multi-modal Models Can Interpret Features in Large Multi-modal Models
- Deep Incomplete Multi-view Clustering with Distribution Dual-Consistency Recovery Guidance
- Acknowledging Focus Ambiguity in Visual Questions
- Combinative Matching for Geometric Shape Assembly
2020 年论文分类汇总戳这里
↘️CVPR-2020-Papers ↘️ECCV-2020-Papers
2021 年论文分类汇总戳这里
↘️ICCV-2021-Papers ↘️CVPR-2021-Papers
2022 年论文分类汇总戳这里
↘️CVPR-2022-Papers ↘️WACV-2022-Papers ↘️ECCV-2022-Papers
2023 年论文分类汇总戳这里
↘️CVPR-2023-Papers ↘️WACV-2023-Papers ↘️ICCV-2023-Papers ↘️2023-CV-Surveys
扫码CV君微信(注明:CVPR)入微信交流群:
