ECCV-2024-Papers
December 9, 2024 · View on GitHub

官网链接:https://eccv.ecva.net/
主会 :bell::9 月 29 日(周日)至 10 月 4 日
历年综述论文分类汇总戳这里↘️CV-Surveys施工中~~~~~~~~~~
2025 年论文分类汇总戳这里
↘️WACV-2025-Papers ↘️CVPR-2025-Papers
2024 年论文分类汇总戳这里
↘️WACV-2024-Papers ↘️CVPR-2024-Papers ↘️ECCV-2024-Papers
2022 年论文分类汇总戳这里
2022 年论文分类汇总戳这里
2021 年论文分类汇总戳这里
2020 年论文分类汇总戳这里
💥💥💥全部论文已分类完毕
:thumbsup:ECCV 2024奖项公布,哥大摘得最佳论文奖桂冠
🏆Best Paper Award(最佳论文奖)
🏅Best Paper Honorable Mention(最佳论文荣誉提名奖)
- Rasterized Edge Gradients: Handling Discontinuities Differentiably
- Concept Arithmetics for Circumventing Concept Inhibition in Diffusion Models
:house:project
目录
58.全家桶
- X-InstructBLIP: A Framework for Aligning Image, 3D, Audio, Video to LLMs and its Emergent Cross-modal Reasoning
:star:code
57.Visual Relationship Detection(视觉关系检测)
- Visual Relationship Transformation
- Scene-Graph ViT: End-to-End Open-Vocabulary Visual Relationship Detection
56.Dense Prediction(密集预测)
- Chameleon: A Data-Efficient Generalist for Dense Visual Prediction in the Wild(https://github.com/GitGyun/chameleon)密集视觉预测
- Unsupervised Dense Prediction using Differentiable Normalized Cuts
- Three Things We Need to Know About Transferring Stable Diffusion to Visual Dense Prediciton Tasks
- Removing Rows and Columns of Tokens in Vision Transformer enables Faster Dense Prediction without Retraining
:star:code
55.Information Security(信息安全)
- 版权保护
- 图像水印
- Certifiably Robust Image Watermark
:star:code - A Secure Image Watermarking Framework with Statistical Guarantees via Adversarial Attacks on Secret Key Networks图像水印
- Not Just Change the Labels, Learn the Features: Watermarking Deep Neural Networks with Multi-View Data
:star:code - A Watermark-Conditioned Diffusion Model for IP Protection
:star:code - A Geometric Distortion Immunized Deep Watermarking Framework with Robustness Generalizability
- LaWa: Using Latent Space for In-Generation Image Watermarking
- Certifiably Robust Image Watermark
54.Deepfake Detection
- Real Appearance Modeling for More General Deepfake Detection
- Contrasting Deepfakes Diffusion via Contrastive Learning and Global-Local Similarities
:star:code - Fake It till You Make It: Curricular Dynamic Forgery Augmentations towards General Deepfake Detection
- Common Sense Reasoning for Deep Fake Detection
:star:code - 图像伪造检测和定位
- 文档图像篡改检测
- 合成图像检测
53.Keypoint Detection(关键点检测)
- OpenKD: Opening Prompt Diversity for Zero- and Few-shot Keypoint Detection
:star:code - KeypointDETR: An End-to-End 3D Keypoint Detector
:star:code
52.Visual Entity Recognition(视觉实体识别)
51.Feature Matching
50.Sketches(草图)
49.Light-Field(光场)
48.Computer Graphics(计算机图形学)
47.Animal
- Animal Avatars: Reconstructing Animatable 3D Animals from Casual Videos
:house:project - Ponymation: Learning Articulated 3D Animal Motions from Unlabeled Online Videos
:house:project3D动物运动 - Adaptive High-Frequency Transformer for Diverse Wildlife Re-Identification
:star:code
46.Rendering(渲染)
- City-on-Web: Real-time Neural Rendering of Large-scale Scenes on the Web
:star:code
:house:project - A Probability-guided Sampler for Neural Implicit Surface Rendering
:house:project渲染 - TextDiffuser-2: Unleashing the Power of Language Models for Text Rendering
:house:project - AnyLens: A Generative Diffusion Model with Any Rendering Lens(https://anylens-diffusion.github.io/)
- CityGaussian: Real-time High-quality Large-Scale Scene Rendering with Gaussians
:star:code
:house:project - METACAP: Meta-learning Priors from Multi-View Imagery for Sparse-view Human Performance Capture and Rendering
:house:project - GAURA: Generalizable Approach for Unified Restoration and Rendering of Arbitrary Views
- MaRINeR: Enhancing Novel Views by Matching Rendered Images with Nearby References
:star:code - Learning Unsigned Distance Functions from Multi-view Images with Volume Rendering Priors
:star:code - CaesarNeRF: Calibrated Semantic Representation for Few-Shot Generalizable Neural Rendering
:house:project - IntrinsicAnything: Learning Diffusion Priors for Inverse Rendering Under Unknown Illumination
:star:code渲染 - Photorealistic Object Insertion with Diffusion-Guided Inverse Rendering
:house:project - VersatileGaussian: Real-time Neural Rendering for Versatile Tasks using Gaussian Splatting神经渲染
- UniVoxel: Fast Inverse Rendering by Unified Voxelization of Scene Representation
:star:code - Photorealistic Object Insertion with Diffusion-Guided Inverse Rendering
:house:project - GeoGaussian: Geometry-aware Gaussian Splatting for Scene Rendering
:star:code场景渲染 - GMT: Enhancing Generalizable Neural Rendering via Geometry-Driven Multi-Reference Texture Transfer
:star:code - Boost Your NeRF: A Model-Agnostic Mixture of Experts Framework for High Quality and Efficient Rendering
45.Neural Radiance Fields
- Invertible Neural Warp for NeRF
:star:code - VF-NeRF: Viewshed Fields for Rigid NeRF Registration
- NeRF-XL: NeRF at Any Scale with Multi-GPU
:house:project - Regularizing Dynamic Radiance Fields with Kinematic Fields
- KFD-NeRF: Rethinking Dynamic NeRF with Kalman Filter
:star:code - Dynamic Neural Radiance Field From Defocused Monocular Video
- Flash Cache: Reducing Bias in Radiance Cache Based Inverse Rendering
:house:project - Protecting NeRFs' Copyright via Plug-And-Play Watermarking Base Model
:house:project - GeometrySticker: Enabling Ownership Claim of Recolorized Neural Radiance Fields
:star:code
:house:project - Efficient NeRF Optimization - Not All Samples Remain Equally Hard
- MeshFeat: Multi-Resolution Features for Neural Fields on Meshes
:house:project - DecentNeRFs: Decentralized Neural Radiance Fields from Crowdsourced Images
:house:project - TrackNeRF: Bundle Adjusting NeRF from Sparse and Noisy Views via Feature Tracks
:star:code - BeNeRF: Neural Radiance Fields from a Single Blurry Image and Event Stream
:star:code - TriNeRFLet: A Wavelet Based Multiscale Triplane NeRF Representation
:house:project - RS-NeRF: Neural Radiance Fields from Rolling Shutter Images
:star:code - Motion-Oriented Compositional Neural Radiance Fields for Monocular Dynamic Human Modeling
:star:code
:house:project - RaFE: Generative Radiance Fields Restoration
:house:project - Few-shot NeRF by Adaptive Rendering Loss Regularization
:star:code - Depth-guided NeRF Training via Earth Mover’s Distance
- DatasetNeRF: Efficient 3D-aware Data Factory with Generative Radiance Fields
:star:code - Flowed Time of Flight Radiance Fields
- Volumetric Rendering with Baked Quadrature Fields
- BeNeRF:Neural Radiance Fields from a Single Blurry Image and Event Stream
:star:code - Taming Latent Diffusion Model for Neural Radiance Field Inpainting
:house:project - Mesh2NeRF: Direct Mesh Supervision for Neural Radiance Field Representation and Generation
:house:project
🤗huggingface - SlotLifter: Slot-guided Feature Lifting for Learning Object-Centric Radiance Fields
:house:project - FisherRF: Active View Selection and Mapping with Radiance Fields using Fisher Information
:star:code - DMiT: Deformable Mipmapped Tri-Plane Representation for Dynamic ScenesNeRF
- Single-Mask Inpainting for Voxel-based Neural Radiance Fields
- Content-Aware Radiance Fields: Aligning Model Complexity with Scene Intricacy Through Learned Bitwidth Quantization
:star:code - Gaussian Frosting: Editable Complex Radiance Fields with Real-Time Rendering
:house:project - Physically Plausible Color Correction for Neural Radiance Fields
- Leveraging Thermal Modality to Enhance Reconstruction in Low-Light ConditionsNeRF
- PointNeRF++: A multi-scale, point-based Neural Radiance Field
:house:project - Omni-Recon: Harnessing Image-based Rendering for General-Purpose Neural Radiance Fields
- High-Fidelity and Transferable NeRF Editing by Frequency Decomposition
:house:project - TriNeRFLet: A Wavelet Based Triplane NeRF Representation
:house:project - Diffusion-Generated Pseudo-Observations for High-Quality Sparse-View Reconstruction
:house:project - G2fR: Frequency Regularization in Grid-based Feature Encoding Neural Radiance Fields
- NeRF-MAE: Masked AutoEncoders for Self-Supervised 3D Representation Learning for Neural Radiance Fields
:house:project - 新视图合成
- Fast View Synthesis of Casual Videos
:house:project - PolyOculus: Simultaneous Multi-view Image-based Novel View Synthesis
:house:project - RING-NeRF : Rethinking Inductive Biases for Versatile and Efficient Neural Fields
- Structured-NeRF: Hierarchical Scene Graph with Neural Representation
- URS-NeRF: Unordered Rolling Shutter Bundle Adjustment for Neural Radiance Fields
- A Compact Dynamic 3D Gaussian Representation for Real-Time Dynamic View Synthesis
:star:code
:house:project - High-Resolution and Few-shot View Synthesis from Asymmetric Dual-lens Inputs
:star:code - Distractor-Free Novel View Synthesis via Exploiting Memorization Effect in Optimization
:star:code - NVS-Adapter: Plug-and-Play Novel View Synthesis from a Single Image
:star:code - FSGS: Real-Time Few-shot View Synthesis using Gaussian Splatting
:star:code - Fast View Synthesis of Casual Videos with Soup-of-Planes
:house:project - CoherentGS: Sparse Novel View Synthesis with Coherent 3D Gaussians
:house:project - MegaScenes: Scene-Level View Synthesis at Scale
:star:code - Radiative Gaussian Splatting for Efficient X-ray Novel View Synthesis
:star:code视图合成 - NGP-RT: Fusing Multi-Level Hash Features with Lightweight Attention for Real-Time Novel View Synthesis
- Efficient Depth-Guided Urban View Synthesis
:star:code - Generative Camera Dolly: Extreme Monocular Dynamic Novel View Synthesis
:star:code - Generalizable Human Gaussians for Sparse View Synthesis
:house:project - Thermal3D-GS: Physics-induced 3D Gaussians for Thermal Infrared Novel-view Synthesis
:star:code
- Fast View Synthesis of Casual Videos
44.Dataset/Benchmark(数据集/基准)
- FYI: Flip Your Images for Dataset Distillation
- Neural Spectral Decomposition for Dataset Distillation
:star:code - Teddy: Efficient Large-Scale Dataset Distillation via Taylor-Approximated Matching
:star:code - Distill Gold from Massive Ores: Bi-level Data Pruning towards Efficient Dataset Distillation
:star:code - COM Kitchens: An Unedited Overhead-view Video Dataset as a Vision-Language Benchmark
- 基准
- MM-SafetyBench: A Benchmark for Safety Evaluation of Multimodal Large Language Models
:star:code - DailyDVS-200: A Comprehensive Benchmark Dataset for Event-Based Action Recognition
:star:code - Urban Waterlogging Detection: A Challenging Benchmark and Large-Small Model Co-Adapter
:star:code - MSD: A Benchmark Dataset for Floor Plan Generation of Building Complexes
- BlinkVision: A Benchmark for Optical Flow, Scene Flow and Point Tracking Estimation using RGB Frames and Eventsbr>:house:project
- SIMBA: Split Inference - Mechanisms, Benchmarks and Attacks
:star:code - A Multimodal Benchmark Dataset and Model for Crop Disease Diagnosis
:star:code - BAFFLE: A Baseline of Backpropagation-Free Federated Learning
:star:code - Omni6DPose: A Benchmark and Model for Universal 6D Object Pose Estimation and Tracking
- Affective Visual Dialog: A Large-Scale Benchmark for Emotional Reasoning Based on Visually Grounded Conversations
:house:project - UniIR: Training and Benchmarking Universal Multimodal Information Retrievers
:house:project - HyTAS: A Hyperspectral Image Transformer Architecture Search Benchmark and Analysis
- OphNet: A Large-Scale Video Benchmark for Ophthalmic Surgical Workflow Understanding
:house:project - PredBench: Benchmarking Spatio-Temporal Prediction across Diverse Disciplines
:star:code - Cross-Platform Video Person ReID: A New Benchmark Dataset and Adaptation Approach
:star:code - R^2-Bench: Benchmarking the Robustness of Referring Perception Models under Perturbations
:star:code - m&m’s: A Benchmark to Evaluate Tool-Use for multi-step multi-modal Tasks
:star:code
🤗huggingface - PathMMU: A Massive Multimodal Expert-Level Benchmark for Understanding and Reasoning in Pathology
🤗huggingface - LayeredFlow: A Real-World Benchmark for Non-Lambertian Multi-Layer Optical Flow
:house:project - HIMO: A New Benchmark for Full-Body Human Interacting with Multiple Objects
:star:code - When Pedestrian Detection Meets Multi-Modal Learning: Generalist Model and Benchmark Dataset
:star:code
- MM-SafetyBench: A Benchmark for Safety Evaluation of Multimodal Large Language Models
- 数据集
- VITATECS: A Diagnostic Dataset for Temporal Concept Understanding of Video-Language Models
:star:code - HaloQuest: A Visual Hallucination Dataset for Advancing Multimodal Reasoning
:star:code - OmniACT: A Dataset and Benchmark for Enabling Multimodal Generalist Autonomous Agents for Desktop and Web
- COM Kitchens: An Unedited Overhead-view Procedural Videos Dataset a Vision-Language Benchmark
:sunflower:dataset - Seeing Faces in Things: A Model and Dataset for Pareidolia
:sunflower:dataset - Towards Dual Transparent Liquid Level Estimation in Biomedical Lab: Dataset, Methods and Practice
:sunflower:dataset - GarmentCodeData: A Dataset of 3D Made-to-Measure Garments With Sewing Patterns
:house:project - SemTrack: A Large-scale Dataset for Semantic Tracking in the Wild
:sunflower:dataset - WiMANS: A Benchmark Dataset for WiFi-based Multi-user Activity Sensing
:star:code - BugNIST - a Large Volumetric Dataset for Detection under Domain Shift
- Defect Spectrum: A Granular Look of Large-scale Defect Datasets with Rich Semantics
:star:code
:house:project大规模缺陷数据集 - Raindrop Clarity: A Dual-Focused Dataset for Day and Night Raindrop Removal
:star:code - PartImageNet++ Dataset: Scaling up Part-based Models for Robust Recognition
:star:code - WTS: A Pedestrian-Centric Traffic Video Dataset for Fine-grained Spatial-Temporal Understanding
:star:code - MMVR: Millimeter-wave Multi-View Radar Dataset and Benchmark for Indoor Perception
- SkyScenes: A Synthetic Dataset for Aerial Scene Understanding
:house:project - Caltech Aerial RGB-Thermal Dataset in the Wild
:star:code - V2X-Real: a Largs-Scale Dataset for Vehicle-to-Everything Cooperative Perception
- H-V2X: A Large Scale Highway Dataset for BEV Perception
- PetFace: A Large-Scale Dataset and Benchmark for Animal Identification
:star:code - Long-range Turbulence Mitigation: A Large-scale Dataset and A Coarse-to-fine Framework
- OmniNOCS: A unified NOCS dataset and model for 3D lifting of 2D objects
:star:code - SignAvatars: A Large-scale 3D Sign Language Holistic Motion Dataset and Benchmark
:star:code
:house:project - Insect Identification in the Wild: The AMI Dataset
:star:code野外昆虫识别:AMI 数据集 - RoScenes: A Large-scale Multi-view 3D Dataset for Roadside Perception
:sunflower:dataset
- VITATECS: A Diagnostic Dataset for Temporal Concept Understanding of Video-Language Models
- 数据增强
43.Sound
- Audio-Synchronized Visual Animation
:star:code
:house:project - Listen to Look into the Future: Audio-Visual Egocentric Gaze Anticipation
:house:project - Label-anticipated Event Disentanglement for Audio-Visual Video Parsing
- Masked Generative Video-to-Audio Transformers with Enhanced Synchronicity
:star:code - Spherical World-Locking for Audio-Visual Localization in Egocentric Videos
- Self-Supervised Audio-Visual Soundscape Stylization
:house:project - CAT: Enhancing Multimodal Large Language Model to Answer Questions in Dynamic Audio-Visual Scenarios
:star:code视听场景 - Perceptual Evaluation of Audio-Visual Synchrony Grounded in Viewers’ Opinion Scores
- Siamese Vision Transformers are Scalable Audio-visual Learners
:star:code视听学习器 - Action2Sound: Ambient-Aware Generation of Action Sounds from Egocentric Videos
:house:project生成环境感知的动作声音 - Audio-visual Generalized Zero-shot Learning the Easy Way
- 视听分割
42.Optical Flow Estimation(光流估计)
41.Biomedical(生物特征识别)
40.Object Pose Estimation(物体姿态估计)
- SCAPE: A Simple and Strong Category-Agnostic Pose Estimator
:star:code - SRPose: Two-view Relative Pose Estimation with Sparse Keypoints
:house:project - FAFA: Frequency-Aware Flow-Aided Self-Supervision for Underwater Object Pose Estimation
:star:code - A Graph-Based Approach for Category-Agnostic Pose Estimation
:house:project - GS-Pose: Category-Level Object Pose Estimation via Geometric and Semantic Correspondence
- OP-Align: Object-level and Part-level Alignment for Self-supervised Category-level Articulated Object Pose Estimation
:star:code - FoundPose: Unseen Object Pose Estimation with Foundation Features
:house:project - LaPose: Laplacian Mixture Shape Modeling for RGB-Based Category-Level Object Pose Estimation
:star:code - U-COPE: Taking a Further Step to Universal 9D Category-level Object Pose Estimation
- PACE: Pose Annotations in Cluttered Environments
:star:code - 6-DoF
- An Economic Framework for 6-DoF Grasp Detection
:star:code - Pseudo-keypoint RKHS Learning for Self-supervised 6DoF Pose Estimation
- Language-Driven 6-DoF Grasp Detection Using Negative Prompt Guidance
:star:code - Omni6D: Large-Vocabulary 3D Object Dataset for Category-Level 6D Object Pose Estimation
:star:code - 6DGS: 6D Pose Estimation from a Single Image and a 3D Gaussian Splatting Model
:star:code - FreeZe: Training-free zero-shot 6D pose estimation with geometric and vision foundation models
:house:project
- An Economic Framework for 6-DoF Grasp Detection
- 相机姿态估计
- 计数
- AFreeCA: Annotation-Free Counting for All计数
- Zero-shot Object Counting with Good Exemplars
- ABC Easy as 123: A Blind Counter for Exemplar-Free Multi-Class Class-agnostic Counting
:star:code
:house:project计数 - Class-Agnostic Object Counting with Text-to-Image Diffusion Model
- Shifted Autoencoders for Point Annotation Restoration in Object Counting
39.Robots(机器人)
- See and Think: Embodied Agent in Virtual Environment
:house:project - SceneGraphLoc: Cross-Modal Coarse Visual Localization on 3D Scene Graphs
- V-IRL: Grounding Virtual Intelligence in Real Life
:star:code - 机器人
- Robo-ABC: Affordance Generalization Beyond Categories via Semantic Correspondence for Robot Manipulation
:house:project - Learning Cross-hand Policies of High-DOF Reaching and Grasping机器人
- DISCO: Embodied Navigation and Interaction via Differentiable Scene Semantics and Dual-level Control
:star:code - Real-time Holistic Robot Pose Estimation with Unknown States
:star:code - ManiGaussian: Dynamic Gaussian Splatting for Multi-task Robotic Manipulation
:star:code
:house:project - Adapt2Reward: Adapting Video-Language Models to Generalizable Robotic Rewards via Failure Prompts
- GraspXL: Generating Grasping Motions for Diverse Objects at Scale
:star:code
:house:project - UGG: Unified Generative Grasping
:house:project机器人 - Decomposed Vector-Quantized Variational Autoencoder for Human Grasp Generation
:star:code - Track2Act: Predicting Point Tracks from Internet Videos enables Generalizable Robot Manipulation
:house:project机器人
- Robo-ABC: Affordance Generalization Beyond Categories via Semantic Correspondence for Robot Manipulation
- 导航
- VPR
- Close, But Not There: Boosting Geographic Distance Sensitivity in Visual Place Recognition
:star:code - Navigation Instruction Generation with BEV Perception and Large Language Models
:star:code - Revisit Anything: Visual Place Recognition via Image Segment Retrieval
:star:code - VLAD-BuFF: Burst-aware Fast Feature Aggregation for Visual Place Recognition
:star:code - MeshVPR: Citywide Visual Place Recognition Using 3D Meshes
:star:code
- Close, But Not There: Boosting Geographic Distance Sensitivity in Visual Place Recognition
- SLAM
- Deep Patch Visual SLAM
:star:code - RGBD GS-ICP SLAM
:star:code - I2-SLAM: Inverting Imaging Process for Robust Photorealistic Dense SLAM
- Hyperion - A fast, versatile symbolic Gaussian Belief Propagation framework for Continuous-Time SLAM
:star:code - SGS-SLAM: Semantic Gaussian Splatting For Neural Dense SLAM
- LRSLAM: Low-rank Representation of Signed Distance Fields in Dense Visual SLAM System
- I-SLAM: Inverting Imaging Process for Robust Photorealistic Dense SLAM
- Learn to Memorize and to Forget: A Continual Learning Perspective of Dynamic SLAM
- Self-Supervised Underwater Caustics Removal and Descattering via Deep Monocular SLAM
- CG-SLAM: Efficient Dense RGB-D SLAM in a Consistent Uncertainty-aware 3D Gaussian Field
:star:code
- Deep Patch Visual SLAM
- Try-On
- Time-Efficient and Identity-Consistent Virtual Try-On Using A Variant of Altered Diffusion Models
- Improving Virtual Try-On with Garment-focused Diffusion Models
:star:code - Wear-Any-Way: Manipulable Virtual Try-on via Sparse Correspondence Alignment
:star:code
:house:project - Improving Diffusion Models for Authentic Virtual Try-on in the Wild
:star:code - D4-VTON: Dynamic Semantics Disentangling for Differential Diffusion based Virtual Try-On
:star:code - WildVidFit: Video Virtual Try-On in the Wild via Image-Based Controlled Diffusion Models
:star:code
- 交叉地理定位
- GAReT: Cross-view Video Geolocalization with Adapters and Auto-Regressive Transformers
:star:code - Cross-view image geo-localization with Panorama-BEV Co-Retrieval Network
:star:code - ConGeo: Robust Cross-view Geo-localization across Ground View Variations
:star:code
:house:project交叉视角地理定位 - Benchmarking the Robustness of Cross-view Geo-localization Models
- CityGuessr: City-Level Video Geo-Localization on a Global Scale
- GAReT: Cross-view Video Geolocalization with Adapters and Auto-Regressive Transformers
- 地理定位
- Avatars(虚拟人)
- CanonicalFusion: Generating Drivable 3D Human Avatars from Multiple Images
:star:code - RodinHD: High-Fidelity 3D Avatar Generation with Diffusion Models
:star:code - MeshAvatar: Learning High-quality Triangular Human Avatars from Multi-view Videos
:star:code - PhysAvatar: Learning the Physics of Dressed 3D Avatars from Visual Observations
:house:project - iHuman: Instant Animatable Digital Humans From Monocular Videos
- PAV: Personalized Head Avatar from Unstructured Video Collection
:house:project - Disentangled Clothed Avatar Generation from Text Descriptions
:house:project服装头像生成 - MagicMirror: Fast and High-Quality Avatar Generation with Constrained Search Space
:house:project - 3DFG-PIFu: 3D Feature Grids for Human Digitization from Sparse Views
- FAMOUS: High-Fidelity Monocular 3D Human Digitization Using View Synthesis
:star:code3D 人体数字化 - Instant 3D Human Avatar Generation using Image Diffusion Models
:house:project - Let the Avatar Talk using Texts without Paired Training Data
- CanonicalFusion: Generating Drivable 3D Human Avatars from Multiple Images
- VR
38.Human-Object Interaction(人机交互)
- Controllable Human-Object Interaction Synthesis
:house:project - F-HOI: Toward Fine-grained Semantic-Aligned 3D Human-Object Interactions
- Interaction-centric Spatio-Temporal Context Reasoning for Multi-Person Video HOI Recognition
:star:code - Look Hear: Gaze Prediction for Speech-directed Human Attention
:star:code - Boosting Gaze Object Prediction via Pixel-level Supervision from Vision Foundation Model
:star:code - Revisit Human-Scene Interaction via Space Occupancy
:house:project人机交互 - Exploring Conditional Multi-Modal Prompts for Zero-shot HOI Detection
:star:code - AFF-ttention! Affordances and Attention models for Short-Term Object Interaction Anticipation
- 手-物
- NL2Contact: Natural Language Guided 3D Hand-Object Contact Modeling with Diffusion Model
- Dense Hand-Object(HO) GraspNet with Full Grasping Taxonomy and Dynamics
:star:code - Are Synthetic Data Useful for Egocentric Hand-Object Interaction Detection?
:star:code - Coarse-to-Fine Implicit Representation Learning for 3D Hand-Object Reconstruction from a Single RGB-D Image
37.Style Transfer(风格迁移)
36.Gaze Estimation
- De-confounded Gaze Estimation
- 3DGazeNet: Generalizing Gaze Estimation with Weak Supervision from Synthetic Views
:star:code - LG-Gaze: Learning Geometry-aware Continuous Prompts for Language-Guided Gaze Estimation
- Gaze Target Detection Based on Head-Local-Global Coordination
35.Action Detection(动作检测)
- LEGO: Learning EGOcentric Action Frame Generation via Visual Instruction Tuning
:star:code
:house:project - ActionSwitch: Class-agnostic Detection of Simultaneous Actions in Streaming Videos
- Spatio-Temporal Proximity-Aware Dual-Path Model for Panoramic Activity Recognition
- Motion Keyframe Interpolation for Any Human Skeleton using Point Cloud-based Human Motion Data Homogenisation运动关键帧插值
- 基于骨架的动作识别
- SA-DVAE: Improving Zero-Shot Skeleton-Based Action Recognition by Disentangled Variational Autoencoders
:star:code - Towards Physical World Backdoor Attacks against Skeleton Action Recognition
:house:project - S-JEPA: A Joint Embedding Predictive Architecture for Skeletal Action Recognition
:house:project - Idempotent Unsupervised Representation Learning for Skeleton-Based Action Recognition
:star:code - CrossGLG: LLM Guides One-shot Skeleton-based 3D Action Recognition in a Cross-level Manner
- SA-DVAE: Improving Zero-Shot Skeleton-Based Action Recognition by Disentangled Variational Autoencoders
- 小样本动作识别
- 时序动作检测
- 时序动作定位
- HAT: History-Augmented Anchor Transformer for Online Temporal Action Localization
:star:code - Towards Adaptive Pseudo-label Learning for Semi-Supervised Temporal Action Localization
- Online Temporal Action Localization with Memory-Augmented Transformer
:house:project - Stepwise Multi-grained Boundary Detector for Point-supervised Temporal Action Localization
- HAT: History-Augmented Anchor Transformer for Online Temporal Action Localization
- 时序动作分割
- Long-Tail Temporal Action Segmentation with Group-wise Temporal Logit Adjustment
:star:code - Two-Stage Active Learning for Efficient Temporal Action Segmentation
- Language-Assisted Skeleton Action Understanding for Skeleton-Based Temporal Action Segmentation
:star:code - Synchronization is All You Need: Exocentric-to-Egocentric Transfer for Temporal Action Segmentation with Unlabeled Synchronized Video Pairs
:star:code
- Long-Tail Temporal Action Segmentation with Group-wise Temporal Logit Adjustment
- 动作质量评估
- Semi-Supervised Teacher-Reference-Student Architecture for Action Quality Assessment
:star:code - RICA^2: Rubric-Informed, Calibrated Assessment of Actions
:house:project - Vision-Language Action Knowledge Learning for Semantic-Aware Action Quality Assessment动作质量评估
- MAGR: Manifold-Aligned Graph Regularization for Continual Action Quality Assessment
:star:code
- Semi-Supervised Teacher-Reference-Student Architecture for Action Quality Assessment
- 动作预测
- 动作识别
- Referring Atomic Video Action Recognition
:star:code - DEAR: Depth-Enhanced Action Recognition
- Bayesian Evidential Deep Learning for Online Action Detection
- C2C: Component-to-Composition Learning for Zero-Shot Compositional Action Recognition
:star:code - Masked Video and Body-worn IMU Autoencoder for Egocentric Action Recognition
- Classification Matters: Improving Video Action Detection with Class-Specific Attention
- FinePseudo: Improving Pseudo-Labelling through Temporal-Alignablity for Semi-Supervised Fine-Grained Action Recognition
:house:project - Context-Aware Action Recognition: Introducing a Comprehensive Dataset for Behavior Contrast
- Multimodal Cross-Domain Few-Shot Learning for Egocentric Action Recognition
:house:project - On the Utility of 3D Hand Poses for Action Recognition
:house:project - POET: Prompt Offset Tuning for Continual Human Action Adaptation
:star:code - Occluded Gait Recognition with Mixture of Experts: An Action Detection Perspective
:star:code - Leveraging temporal contextualization for video action recognition
:star:code - Optimizing Factorized Encoder Models: Time and Memory Reduction for Scalable and Efficient Action Recognition
- SkateFormer: Skeletal-Temporal Transformer for Human Action Recognition
:house:project
- Referring Atomic Video Action Recognition
- 动作理解
- 群体动作识别
- 癫痫发作检测
34.Visual Question Answering(视觉问答)
- DriveLM: Driving with Graph Visual Question Answering
:star:code - Diffusion-Refined VQA Annotations for Semi-Supervised Gaze Following
- WSI-VQA: Interpreting Whole Slide Images by Generative Visual Question Answering
:star:code - GRACE: Graph-Based Contextual Debiasing for Fair Visual Question Answering
- Q&A Prompts: Discovering Rich Visual Clues through Mining Question-Answer Prompts for VQA requiring Diverse World Knowledge
:star:code - Compositional Substitutivity of Visual Reasoning for Visual Question Answering
:star:code - Fully Authentic Visual Question Answering Dataset from Online Communities
:house:project - An Explainable Vision Question Answer Model via Diffusion Chain-of-Thought
- 音视频问答
- 视频问答
- Video Question Answering with Procedural Programs
:house:project - ViLA: Efficient Video-Language Alignment for Video Question Answering
:star:code - TimeCraft: Navigate Weakly-Supervised Temporal Grounded Video Question Answering via Bi-directional ReasoningVQA
- AutoEval-Video: An Automatic Benchmark for Assessing Large Vision Language Models in Open-Ended Video Question Answering
:star:code
- Video Question Answering with Procedural Programs
- 视听问答
33.Motion Generation(人体运动生成)
- Event-Based Motion Magnification
:star:code - Learning-based Axial Video Motion Magnification
:house:project - SMooDi: Stylized Motion Diffusion Model
:star:code - Length-Aware Motion Synthesis via Latent Diffusion
:star:code - HUMOS: Human Motion Model Conditioned on Body Shape
:star:code - HumanRefiner: Benchmarking Abnormal Human Generation and Refining with Coarse-to-fine Pose-Reversible Guidance
:star:code - Generating Physically Realistic and Directable Human Motions from Multi-Modal Inputs
:house:project - Generating Human Interaction Motions in Scenes with Text Control
:house:project运动生成 - Motion Mamba: Efficient and Long Sequence Motion Generation
:star:code
:house:project - Large Motion Model for Unified Multi-Modal Motion Generation
:house:project - EMDM: Efficient Motion Diffusion Model for Fast and High-Quality Motion Generation
:star:code
:house:project - Bridging the Gap Between Human Motion and Action Semantics via Kinematics Phrases
:house:project人体运动 - TRAM: Global Trajectory and Motion of 3D Humans from in-the-wild Videos
:house:project人体运动 - Nymeria: A Massive Collection of Egocentric Multi-modal Human Motion in the Wild人体运动
- FreeMotion: MoCap-Free Human Motion Synthesis with Multimodal Large Language Models
- MotionLCM: Real-time Controllable Motion Generation via Latent Consistency Model
:star:code - Realistic Human Motion Generation with Cross-Diffusion Models
:house:project人体运动 - CoMo: Controllable Motion Generation through Language Guided Pose Code Editing
:house:project生成可控运动 - TLControl: Trajectory and Language Control for Human Motion Synthesis
:house:project人体运动合成 - Retrieval Robust to Object Motion Blur
:star:[code]((https://github.com/Rong-Zou/Retrieval-Robust-to-Object-Motion-Blur) - 三维人体运动合成
- 文本-动作合成
- FreeMotion: A Unified Framework for Number-free Text-to-Motion Synthesis
- Local Action-Guided Motion Diffusion Model for Text-to-Motion Generation
:star:code - Plan, Posture and Go: Towards Open-vocabulary Text-to-Motion Generation
:house:project - ParCo: Part-Coordinating Text-to-Motion Synthesis
:star:code
- 人体运动预测
- 人体运动估计
- 运动估计
- 舞蹈生成
- 行为生成
- 运动迁移
- 运动预测
32.Person Re-Identification(人员重识别)
- Human-in-the-Loop Visual Re-ID for Population Size Estimation
:star:code - 行人重识别
- Keypoint Promptable Re-Identification
:star:code - Privacy-Preserving Adaptive Re-Identification without Image Transfer
- Rethinking Normalization Layers for Domain Generalizable Person Re-identification
:star:code - Domain Shifting: A Generalized Solution for Heterogeneous Cross-Modality Person Re-Identification
- VI-ReID
- Keypoint Promptable Re-Identification
- 人物搜索
- 步态识别
- 计数
31.Point Clouds(点云)
- SEED: A Simple and Effective 3D DETR in Point Clouds
:star:code - PointLLM: Empowering Large Language Models to Understand Point Clouds
:star:code
:house:project - TransCAD: A Hierarchical Transformer for CAD Sequence Inference from Point Clouds
- Learning to Adapt SAM for Segmenting Cross-domain Point Clouds
- Meerkat: Audio-Visual Large Language Model for Grounding in Space and Time
- milliFlow: Scene Flow Estimation on mmWave Radar Point Cloud for Human Motion Sensing
:star:code - Fast Point Cloud Geometry Compression with Context-based Residual Coding and INR-based Refinement
- Learning Local Pattern Modularization for Point Cloud Reconstruction from Unseen Classes
:star:code - T-MAE: Temporal Masked Autoencoders for Point Cloud Representation Learning
:star:code - Progressive Classifier and Feature Extractor Adaptation for Unsupervised Domain Adaptation on Point Clouds
:star:code - PFGS: High Fidelity Point Cloud Rendering via Feature Splatting
:star:code - Masked Motion Prediction with Semantic Contrast for Point Cloud Sequence Learning
:star:code - To Supervise or Not to Supervise: Understanding and Addressing the Key Challenges of Point Cloud Transfer Learning
- Relightable 3D Gaussians: Realistic Point Cloud Relighting with BRDF Decomposition and Ray Tracing
:star:code - FastPCI: Motion-Structure Guided Fast Point Cloud Frame Interpolation
:star:code - 点云生成
- RangeLDM: Fast Realistic LiDAR Point Cloud Generation
:star:code - Text2LiDAR: Text-guided LiDAR Point Clouds Generation via Equirectangular Transformer
:star:code - Fast Training of Diffusion Transformer with Extreme Masking for 3D Point Clouds Generation
:house:project - FrePolad: Frequency-Rectified Point Latent Diffusion for Point Cloud Generation
:house:project
- RangeLDM: Fast Realistic LiDAR Point Cloud Generation
- 点云完成
- Explicitly Guided Information Interaction Network for Cross-modal Point Cloud Completion
:star:code - T-CorresNet: Template Guided 3D Point Cloud Completion with Correspondence Pooling Query Generation Strategy
:star:code - AEDNet: Adaptive Embedding and Multiview-Aware Disentanglement for Point Cloud Completion
- EINet: Point Cloud Completion via Extrapolation and Interpolation
:star:code - Syn-to-Real Domain Adaptation for Point Cloud Completion via Part-based Approach
:star:code - ProtoComp: Diverse Point Cloud Completion with Controllable Prototype
:star:code
- Explicitly Guided Information Interaction Network for Cross-modal Point Cloud Completion
- 点云重建
- 点云理解
- 点云配准
- ML-SemReg: Boosting Point Cloud Registration with Multi-level Semantic Consistency
:star:code - PointRegGPT: Boosting 3D Point Cloud Registration using Generative Point-Cloud Pairs for Training
:star:code - SemReg: Semantics Constrained Point Cloud Registration
:star:code - Correspondence-Free SE(3) Point Cloud Registration in RKHS via Unsupervised Equivariant Learning
:house:project - UMERegRobust – Universal Manifold Embedding Compatible Features for Robust Point Cloud Registration
:star:code - PARE-Net: Position-Aware Rotation-Equivariant Networks for Robust Point Cloud Registration
:star:code - UMERegRobust -- Universal Manifold Embedding Compatible Features for Robust Point Cloud Registration
:star:code - Equi-GSPR: Equivariant SE(3) Graph Network Model for Sparse Point Cloud Registration点云配准
- ML-SemReg: Boosting Point Cloud Registration with Multi-level Semantic Consistency
- 点云分割
- Dual-level Adaptive Self-Labeling for Novel Class Discovery in Point Cloud Segmentation
- HGL: Hierarchical Geometry Learning for Test-time Adaptation in 3D Point Cloud Segmentation
:star:code - SegPoint: Segment Any Point Cloud via Large Language Model
:star:code - Localization and Expansion: A Decoupled Framework for Point Cloud Few-shot Semantic Segmentation
- Pseudo-Embedding for Generalized Few-Shot Point Cloud Segmentation
:star:code - Subspace Prototype Guidance for Mitigating Class Imbalance in Point Cloud Semantic Segmentation
:star:code
- 点云理解
- 3D点云
- Implicit Filtering for Learning Neural Signed Distance Functions from 3D Point Clouds
:star:code - CloudFixer: Test-Time Adaptation for 3D Point Clouds via Diffusion-Guided Geometric Transformation
:star:code - FLAT: Flux-aware Imperceptible Adversarial Attacks on 3D Point Clouds
- RISurConv: Rotation Invariant Surface Attention-Augmented Convolutions for 3D Point Cloud Classification and Segmentation
- P2P-Bridge: Diffusion Bridges for 3D Point Cloud Denoising
:star:code - Heterogeneous Graph Learning for Scene Graph Prediction in 3D Point Clouds
- Hiding Imperceptible Noise in Curvature-Aware Patches for 3D Point Cloud Attack3D 点云攻击
- Continuous SO(3) Equivariant Convolution for 3D Point Cloud Analysis
:star:code - Frugal 3D Point Cloud Model Training via Progressive Near Point Filtering and Fused Aggregation
- Implicit Filtering for Learning Neural Signed Distance Functions from 3D Point Clouds
30.Anomaly Detection(异常检测)
- Continuous Memory Representation for Anomaly Detection
:star:code - Dissolving Is Amplifying: Towards Fine-Grained Anomaly Detection
:star:code - Few-Shot Anomaly-Driven Generation for Anomaly Classification and Segmentation
:star:code - GeneralAD: Anomaly Detection Across Domains by Attending to Distorted Features
:star:code - Learning Diffusion Models for Multi-View Anomaly Detection
- Hierarchical Gaussian Mixture Normalizing Flow Modeling for Unified Anomaly Detection
:star:code - TransFusion -- A Transparency-Based Diffusion Model for Anomaly Detection
:star:code - Unsupervised, Online and On-The-Fly Anomaly Detection For Non-Stationary Image Distributions
:star:code - MoEAD: A Parameter-efficient Model for Multi-class Anomaly Detection
:star:code - 缺陷检测
- 故障检测
- 3D异常检测
- 工业异常检测
- Self-supervised Feature Adaptation for 3D Industrial Anomaly Detection
- A Unified Anomaly Synthesis Strategy with Gradient Ascent for Industrial Anomaly Detection and Localization
:star:code - GLAD: Towards Better Reconstruction with Global and Local Adaptive Diffusion Models for Unsupervised Anomaly Detection
:star:code - AD3: Introducing a score for Anomaly Detection Dataset Difficulty assessment using VIADUCT dataset
- Learning to Detect Multi-class Anomalies with Just One Normal Image Prompt
- 零样本异常检测
- 多类异常检测
- OOD
- Gradient-Regularized Out-of-Distribution Detection
- SAFT: Towards Out-of-Distribution Generalization in Fine-Tuning
- PixOOD: Pixel-Level Out-of-Distribution Detection
:star:code - An Information Theoretical View for Out-Of-Distribution Detection
- Learning Non-Linear Invariants for Unsupervised Out-of-Distribution Detection
- LAPT: Label-driven Automated Prompt Tuning for OOD Detection with Vision-Language Models
:star:code - ProSub: Probabilistic Open-Set Semi-Supervised Learning with Subspace-Based Out-of-Distribution Detection
:star:code - Diffusion for Out-of-Distribution Detection on Road Scenes and Beyond
:star:code - Can Your Generative Model Detect Out-of-Distribution Covariate Shift?
- Gradient-based Out-of-Distribution Detection
- Vision-Language Dual-Pattern Matching for Out-of-Distribution Detection
- TAG: Text Prompt Augmentation for Zero-Shot Out-of-Distribution Detection
:star:code
- 异常值检测
- 零样本异常分割
29.Semi/self-supervised learning(半/自监督)
- SweepNet: Unsupervised Learning Shape Abstraction via Neural Sweepers
:house:project - Region-aware Distribution Contrast: A Novel Approach to Multi-Task Partially Supervised Learning
:star:code - 自监督
- CroMo-Mixup: Augmenting Cross-Model Representations for Continual Self-Supervised Learning
:star:code - HPFF: Hierarchical Locally Supervised Learning with Patch Feature Fusion
:star:code - SCPNet: Unsupervised Cross-modal Homography Estimation via Intra-modal Self-supervised Learning
:star:code - Efficient Unsupervised Visual Representation Learning with Explicit Cluster Balancing
- OmniSat: Self-Supervised Modality Fusion for Earth Observation
:star:code
:house:project
:sunflower:dataset - FroSSL: Frobenius Norm Minimization for Efficient Multiview Self-Supervised Learning
- Self-supervised visual learning from interactions with objects
- Exploiting Supervised Poison Vulnerability to Strengthen Self-Supervised Defense
- GenView: Enhancing View Quality with Pretrained Generative Model for Self-Supervised Learning
:star:code - On Pretraining Data Diversity for Self-Supervised Learning
:star:code - Decoupling Common and Unique Representations for Multimodal Self-supervised Learning
:star:code - POA: Pre-training Once for Models of All Sizes
:star:code - ViC-MAE: Self-Supervised Representation Learning from Images and Video with Contrastive Masked Autoencoders自监督表示学习
- Pose-Aware Self-Supervised Learning with Viewpoint Trajectory Regularization
:house:project自监督学习 - SSL-Cleanse: Trojan Detection and Mitigation in Self-Supervised Learning
:star:code
- CroMo-Mixup: Augmenting Cross-Model Representations for Continual Self-Supervised Learning
- 半监督
- Image-Feature Weak-to-Strong Consistency: An Enhanced Paradigm for Semi-Supervised Learning
- Improving 3D Semi-supervised Learning by Effectively Utilizing All Unlabelled Data
:star:code - SCOMatch: Alleviating Overtrusting in Open-set Semi-supervised Learning
:star:code - ExMatch: Self-guided Exploitation for Semi-Supervised Learning with Scarce Labeled Samples
- Rebalancing Using Estimated Class Distribution for Imbalanced Semi-Supervised Learning under Class Distribution Mismatch半监督学习
- Towards Latent Masked Image Modeling for Self-Supervised Visual Representation Learning
:star:code - Flexible Distribution Alignment: Towards Long-tailed Semi-supervised Learning with Proper Calibration
:star:code
28.Novel Class Discovery(新类发现)
27.GNN/GCN
- GKGNet: Group K-Nearest Neighbor based Graph Convolutional Network for Multi-Label Image Recognition
:star:codeGNN - Graph Neural Network Causal Explanation via Neural Causal Models
:star:code - On the Topology Awareness and Generalization Performance of Graph Neural Networks
- Causal Subgraphs and Information Bottlenecks: Redefining OOD Robustness in Graph Neural Networks
26.NAS
- Auto-GAS: Automated Proxy Discovery for Training-free Generative Architecture Search
:star:code - Auto-DAS: Automated Proxy Discovery for Training-free Distillation-aware Architecture Search
:star:code蒸馏感 - SuperFedNAS: Cost-Efficient Federated Neural Architecture Search for On-Device Inference
- Dependency-aware Differentiable Neural Architecture Search
25.MC/KD/Pruning(模型压缩/知识蒸馏/剪枝)
- DεpS: Delayed ε-Shrinking for Faster Once-For-All Training
- 模型压缩
- 剪枝
- Non-transferable Pruning
- Straightforward Layer-wise Pruning for More Efficient Visual Adaptation
- Isomorphic Pruning for Vision Models
:star:code - LPViT: Low-Power Semi-structured Pruning for Vision Transformers
- PaPr: Training-Free One-Step Patch Pruning with Lightweight ConvNets for Faster Inference
:star:code剪 - Enhanced Sparsification via Stimulative Training
:star:code - SNP: Structured Neuron-level Pruning to Preserve Attention Scores
:star:code
- 量化
- GenQ: Quantization in Low Data Regimes with Generative Synthetic Data
:star:code - MetaAug: Meta-Data Augmentation for Post-Training Quantization
- Toward INT4 Fixed-Point Training via Exploring Quantization Error for Gradients
- CLAMP-ViT: Contrastive Data-Free Learning for Adaptive Post-Training Quantization of ViTs
:star:code - AdaLog: Post-Training Quantization for Vision Transformers with Adaptive Logarithm Quantizer
:star:code - POCA: Post-training Quantization with Temporal Alignment for Codec Avatars
:house:project量化
- GenQ: Quantization in Low Data Regimes with Generative Synthetic Data
- KD
- Simple Unsupervised Knowledge Distillation With Space Similarity知识蒸馏
- Direct Distillation between Different DomainsKD
- Harmonizing knowledge Transfer in Neural Network with Unified Distillation
- Good Teachers Explain: Explanation-Enhanced Knowledge Distillation
- The Role of Masking for Efficient Supervised Knowledge Distillation of Vision Transformers
- Improving Knowledge Distillation via Regularizing Feature Direction and Norm
- Adversarially Robust Distillation by Reducing the Student-Teacher Variance Gap蒸馏
- Improving Zero-shot Generalization of Learned Prompts via Unsupervised Knowledge Distillation
:star:code - UNIKD: UNcertainty-Filtered Incremental Knowledge Distillation for Neural Implicit Representation
:star:code - BKDSNN: Enhancing the Performance of Learning-based Spiking Neural Networks Training with Blurred Knowledge Distillation
- Nickel and Diming Your GAN: A Dual-Method Approach to Enhancing GAN Efficiency via Knowledge Distillation
- How to Train the Teacher Model for Effective Knowledge Distillation
- Markov Knowledge Distillation: Make Nasty Teachers trained by Self-undermining Knowledge Distillation Fully Distillable
24.Vision Transformer
- Spline-based Transformers
- Denoising Vision Transformers
- FairViT: Fair Vision Transformer via Adaptive Masking
- Rotary Position Embedding for Vision Transformer
:star:code - Bidirectional Progressive Transformer for Interaction Intention Anticipation
- Robustness Tokens: Towards Adversarial Robustness of Transformers
- SpecFormer: Guarding Vision Transformer Robustness via Maximum Singular Value Penalization
:star:code - PDiscoFormer: Relaxing Part Discovery Constraints with Vision Transformers
- OAT: Object-Level Attention Transformer for Gaze Scanpath Prediction
:star:code - AugDETR: Improving Multi-scale Learning for Detection TransformerTransformer
- AttnZero: Efficient Attention Discovery for Vision Transformers
:star:code - SpatialFormer: Towards Generalizable Vision Transformers with Explicit Spatial Understanding
:star:code - Efficient Vision Transformers with Partial Attention
- SiT: Exploring Flow and Diffusion-based Generative Models with Scalable Interpolant Transformers
:star:code - Stitched ViTs are Flexible Vision Backbones
:star:code - Token Compensator: Altering Inference Cost of Vision Transformer without Re-Tuning
- Uncertainty-Driven Spectral Compressive Imaging with Spatial-Frequency Transformer
:star:code - GiT: Towards Generalist Vision Transformer through Universal Language Interface
:star:code - An Optimal Control View of LoRA and Binary Controller Design for Vision Transformers
- Fairness-aware Vision Transformer via Debiased Self-Attention
:star:code - ScatterFormer: Efficient Voxel Transformer with Scattered Linear Attention
:star:code - LiFT: A Surprisingly Simple Lightweight Feature Transform for Dense ViT Descriptors
:house:project - Parameter-Efficient and Memory-Efficient Tuning for Vision Transformer: A Disentangled Approach
:house:project - LayoutDETR: Detection Transformer Is a Good Multimodal Layout Designer
:star:code - Efficient Diffusion Transformer with Step-wise Dynamic Attention Mediators
:star:code - BAM-DETR: Boundary-Aligned Moment Detection Transformer for Temporal Sentence Grounding in Videos
:star:code - An Efficient and Effective Transformer Decoder-Based Framework for Multi-Task Visual Grounding
:star:code
23.Machine Learning(机器学习)
- Learning to Unlearn for Robust Machine Unlearning
- Is Retain Set All You Need in Machine Unlearning? Restoring Performance of Unlearned Models with Out-Of-Distribution Images
:star:code机器学习 - 机器遗忘
- 对抗
- Improving Adversarial Transferability via Model Alignment
:star:code - Event Trojan: Asynchronous Event-based Backdoor Attacks
:star:code - Data Poisoning Quantization Backdoor Attack
- Flatness-aware Sequential Learning Generates Resilient Backdoors
- WBP: Training-time Backdoor Attacks through Hardware-based Weight Bit Poisoning
:star:code - Cocktail Universal Adversarial Attack on Deep Neural Networks
- TrojVLM: Backdoor Attack Against Vision Language Models
- CatchBackdoor: Backdoor Detection via Critical Trojan Neural Path Fuzzing
- Prompt-Driven Contrastive Learning for Transferable Adversarial Attacks
:star:code - Self-Supervised Representation Learning for Adversarial Attack Detection
- Prediction Exposes Your Face: Black-box Model Inversion via Prediction Alignment
- CLIP-Guided Networks for Transferable Targeted Attacks
- CLIP-Guided Generative Networks for Transferable Targeted Adversarial Attacks
- Exploring Vulnerabilities in Spiking Neural Networks: Direct Adversarial Attacks on Raw Event Data
- UNIT: Backdoor Mitigation via Automated Neural Distribution Tightening
:star:code - Inter-Class Topology Alignment for Efficient Black-Box Substitute Attacks黑盒
- Any Target Can be Offense: Adversarial Example Generation via Generalized Latent Infection
:star:code - AdvDiff: Generating Unrestricted Adversarial Examples using Diffusion Models
:star:code - Enhancing Tracking Robustness with Auxiliary Adversarial Defense Networks
- DIFFender: Diffusion-Based Adversarial Defense against Patch Attacks
:star:code
- Improving Adversarial Transferability via Model Alignment
- 持续学习
- CLEO: Continual Learning of Evolving Ontologies
- One-stage Prompt-based Continual Learning
- Exemplar-free Continual Representation Learning via Learnable Drift Compensation
:star:code - Diffusion-Driven Data Replay: A Novel Approach to Combat Forgetting in Federated Class Continual Learning
:star:code - Semantic Residual Prompts for Continual Learning
:star:code - Pick-a-back: Selective Device-to-Device Knowledge Transfer in Federated Continual Learning
:star:code - RCS-Prompt: Learning Prompt to Rearrange Class Space for Prompt-based Continual Learning
:star:code - PromptFusion: Decoupling Stability and Plasticity for Continual Learning
:star:code - Information Bottleneck Based Data Correction in Continual Learning
- Revisiting Supervision for Continual Representation Learning
:star:code持续 - Anytime Continual Learning for Open Vocabulary Classification
:star:code - MagMax: Leveraging Model Merging for Seamless Continual Learning
- Beyond Prompt Learning: Continual Adapter for Efficient Rehearsal-Free Continual Learning
- 迁移学习
- 主动学习
- Dataset Quantization with Active Learning based Adaptive Sampling
- Generalized Coverage for More Robust Low-Budget Active Learning
- Bad Students Make Great Teachers: Active Learning Accelerates Large-Scale Visual Understanding主动学习
- Bidirectional Uncertainty-Based Active Learning for Open-Set Annotation主动学习
- Exploring Active Learning in Meta-Learning: Enhancing Context Set Labeling
- 强化学习
- Reinforcement Learning Meets Visual Odometry
- Large-scale Reinforcement Learning for Diffusion Models
- Reinforcement Learning via Auxillary Task Distillation
- Reinforcement Learning Friendly Vision-Language Model for Minecraft
:star:code - Multimodal Label Relevance Ranking via Reinforcement Learning
:star:code - Enhancing Diffusion Models with Text-Encoder Reinforcement Learning
:star:code - Diffusion Models as Optimizers for Efficient Planning in Offline RL
:star:code - Unified Local-Cloud Decision-Making via Reinforcement Learning
:house:project强化学习
- 联邦学习
- Towards Multi-modal Transformers in Federated Learning
:star:code - FedHide: Federated Learning by Hiding in the Neighbors
- FedHARM: Harmonizing Model Architectural Diversity in Federated Learning
:star:code - FedTSA: A Cluster-based Two-Stage Aggregation Method for Model-heterogeneous Federated Learning
- Unlocking the Potential of Federated Learning: The Symphony of Dataset Distillation via Deep Generative Latents
:star:code - PFedEdit: Personalized Federated Learning via Automated Model Editing
:star:code - Overcome Modal Bias in Multi-modal Federated Learning via Balanced Modality Selection
- Fisher Calibration for Backdoor-Robust Heterogeneous Federated Learning
:star:code - Federated Learning with Local Openset Noisy Labels
:star:code - SkyMask: Attack-agnostic Robust Federated Learning with Fine-grained Learnable Masks
:star:code
- Towards Multi-modal Transformers in Federated Learning
- 对比学习
- FlowCon: Out-of-Distribution Detection using Flow-based Contrastive Learning
:star:code - Improving Medical Multi-modal Contrastive Learning with Expert Annotations
- Contrastive Learning with Synthetic Positives对比学习
- Understanding and Mitigating Human-Labelling Errors in Supervised Contrastive Learning
- Adaptive Multi-head Contrastive Learning
- CLAP: Isolating Content from Style through Contrastive Learning with Augmented Prompts
:star:code对比学习 - Learning the Unlearned: Mitigating Feature Suppression in Contrastive Learning
:star:code
- FlowCon: Out-of-Distribution Detection using Flow-based Contrastive Learning
- 类增量
- Rethinking Few-shot Class-incremental Learning: Learning from Yourself
:star:code - Few-shot Class Incremental Learning with Attention-Aware Self-Adaptive Prompt
- Class-Incremental Learning with CLIP: Adaptive Representation Adjustment and Parameter Fusion
:star:code - Versatile Incremental Learning: Towards Class and Domain-Agnostic Incremental Learning
:star:code - Confidence Self-Calibration for Multi-Label Class-Incremental Learning
:star:[code](https://github.com/ Kaile-Du/CSC) - Canonical Shape Projection is All You Need for 3D Few-shot Class Incremental Learning
:star:code - Personalized Federated Domain-Incremental Learning based on Adaptive Knowledge Matching
- PILoRA: Prototype Guided Incremental LoRA for Federated Class-Incremental Learning
:star:code - CLOSER: Towards Better Representation Learning for Few-Shot Class-Incremental Learning
:star:code - Non-Exemplar Domain Incremental Learning via Cross-Domain Concept Integration
:star:code - On the Approximation Risk of Few-Shot Class-Incremental Learning
:star:code - iNeMo: Incremental Neural Mesh Models for Robust Class-Incremental Learning
:star:code - DiffClass: Diffusion-Based Class Incremental Learning
- Rethinking Few-shot Class-incremental Learning: Learning from Yourself
- 上下文学习
- 多任务学习
- 多实例学习
- 多模态学习
22.Few/Zero-Shot Learning/DG/A(小/零样本/域泛化/域适应)
- Source-Free Domain-Invariant Performance Prediction
- The Devil is in the Few Shots: Iterative Visual Knowledge Completion for Few-shot Learning
:star:code - DG
- Towards Multimodal Open-Set Domain Generalization and Adaptation through Self-supervision
:star:code - Feature Diversification and Adaptation for Federated Domain Generalization
- Soft Prompt Generation for Domain Generalization
:star:code - Integrating Markov Blanket Discovery into Causal Representation Learning for Domain Generalization
- Rethinking LiDAR Domain Generalization: Single Source as Multiple Density Domains
:star:code - Improving Zero-Shot Generalization for CLIP with Variational Adapter
- Representation Enhancement-Stabilization: Reducing Bias-Variance of Domain Generalization
:star:code - Local and Global Flatness for Federated Domain Generalization
:star:code - Learn to Preserve and Diversify: Parameter-Efficient Group with Orthogonal Regularization for Domain Generalization
- Disentangling Masked Autoencoders for Unsupervised Domain Generalization
:star:code
- Towards Multimodal Open-Set Domain Generalization and Adaptation through Self-supervision
- DA
- Training-Free Model Merging for Multi-target Domain Adaptation
:star:code - MC-PanDA: Mask Confidence for Panoptic Domain Adaptation
:star:code - Is user feedback always informative? Retrieval Latent Defending for Semi-Supervised Domain Adaptation without Source Data
:house:project - De-Confusing Pseudo-Labels in Source-Free Domain Adaptation
- Open-set Domain Adaptation via Joint Error based Multi-class Positive and Unlabeled Learning
- Robust Nearest Neighbors for Source-Free Domain Adaptation under Class Distribution Shift
:star:code - HVCLIP: High-dimensional Vector in CLIP for Unsupervised Domain Adaptation
- Hierarchical Unsupervised Relation Distillation for Source Free Domain Adaptation
- Learn from the Learnt: Source-Free Active Domain Adaptation via Contrastive Sampling and Visual Persistence
:star:code - Improving Unsupervised Domain Adaptation: A Pseudo-Candidate Set Approach
- UDA-Bench: Revisiting Common Assumptions in Unsupervised Domain Adaptation Using a Standardized Framework
:star:code - Forget More to Learn More: Domain-specific Feature Unlearning for Semi-supervised and Unsupervised Domain Adaptation
- Train Till You Drop: Towards Stable and Robust Source-free Unsupervised 3D Domain Adaptation
- Get Your Embedding Space in Order: Domain-Adaptive Regression for Forest Monitoring
:house:project - CoDA: Instructive Chain-of-Domain Adaptation with Severity-Aware Visual Prompt Tuning
:star:code - COD: Learning Conditional Invariant Representation for Domain Adaptation Regression
- Plug and Play: A Representation Enhanced Domain Adapter for Collaborative Perception
:star:code - DA-BEV: Unsupervised Domain Adaptation for Bird's Eye View Perception
- Training-Free Model Merging for Multi-target Domain Adaptation
- 零样本
21.Vision-Language(视觉语言)
- Sapiens: Foundation for Human Vision Models
- Conceptual Codebook Learning for Vision-Language Models
- DEAL: Disentangle and Localize Concept-level Explanations for VLMs
- FlexAttention for Efficient High-Resolution Vision-Language Models
:house:project - QUAR-VLA: Vision-Language-Action Model for Quadruped Robots
- Vary: Scaling up the Vision Vocabulary for Large Vision-Language Models
- REVISION: Rendering Tools Enable Spatial Fidelity in Vision-Language Models
:house:project - Octopus: Embodied Vision-Language Programmer from Environmental Feedback
:house:project - GazeXplain: Learning to Predict Natural Language Explanations of Visual Scanpaths
- Learning Chain of Counterfactual Thought for Bias-Robust Vision-Language Reasoning
:star:code - Boosting Transferability in Vision-Language Attacks via Diversification along the Intersection Region of Adversarial Trajectory
:star:code - Cascade Prompt Learning for Vision-Language Model Adaptation
:star:code - The Hard Positive Truth about Vision-Language Compositionality
- Improving 2D Feature Representations by 3D-Aware Fine-Tuning
:star:code - Unveiling Typographic Deceptions: Insights of the Typographic Vulnerability in Large Vision-Language Models
:star:code - Turbo: Informativity-Driven Acceleration Plug-In for Vision-Language Large Models
- ClearCLIP: Decomposing CLIP Representations for Dense Vision-Language Inference
:star:code - FALIP: Visual Prompt as Foveal Attention Boosts CLIP Zero-Shot Performance
:house:project - Deciphering the Role of Representation Disentanglement: Investigating Compositional Generalization in CLIP Models
:star:code - GalLoP: Learning Global and Local Prompts for Vision-Language Models
- Quantized Prompt for Efficient Generalization of Vision-Language Models
:star:code - AddressCLIP: Empowering Vision-Language Models for City-wide Image Address Localization
:star:code
:Thumbsup:AddressCLIP:一张图实现街道级定位,端到端图像地理定位大模型 - SQ-LLaVA: Self-Questioning for Large Vision-Language Assistant
- Contrastive Region Guidance: Improving Grounding in Vision-Language Models without Training
:house:project - Cascade Prompt Learning for Visual-Language Model Adaptation
:star:code - Select and Distill: Selective Dual-Teacher Knowledge Transfer for Continual Learning on Vision-Language Models
:house:project - Unveiling Typographic Deceptions: Insights of the Typographic Vulnerability in Large Vision-Language Model
- Introducing Routing Functions to Vision-Language Parameter-Efficient Fine-Tuning with Low-Rank Bottlenecks
- Take A Step Back: Rethinking the Two Stages in Visual Reasoning
:star:code - HYDRA: A Hyper Agent for Dynamic Compositional Visual Reasoning
:star:code视觉推理 - Mind the Interference: Retaining Pre-trained Knowledge in Parameter Efficient Continual Learning of Vision-Language Models
:star:code - An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models
:star:code - Improving Vision and Language Concepts Understanding with Multimodal Counterfactual Samples
:star:code - Reflective Instruction Tuning: Mitigating Hallucinations in Large Vision-Language Models
:star:code - SDPT: Synchronous Dual Prompt Tuning for Fusion-based Visual-Language Pre-trained Models
:star:code - Robust Calibration of Large Vision-Language Adapters
:star:code - BEAF: Observing BEfore-AFter Changes to Evaluate Hallucination in Vision-language Models
:star:code - CLIP-DPO: Vision-Language Models as a Source of Preference for Fixing Hallucinations in LVLMs
- MyVLM: Personalizing VLMs for User-Specific Queries
:star:code - BRAVE: Broadening the visual encoding of vision-language models
:house:project - IVTP: Instruction-guided Visual Token Pruning for Large Vision-Language Models
- ShareGPT4V: Improving Large Multi-Modal Models with Better Captions
:star:code
:house:project - The First to Know: How Token Distributions Reveal Hidden Knowledge in Large Vision-Language Models?
:star:code - Safe-CLIP: Removing NSFW Concepts from Vision-and-Language Models
:star:code - Adapt without Forgetting: Distill Proximity from Dual Teachers in Vision-Language Models
:star:code - Exploiting Semantic Reconstruction to Mitigate Hallucinations in Vision-Language Models
- uCAP: An Unsupervised Prompting Method for Vision-Language Models
- Training A Small Emotional Vision Language Model for Visual Art Comprehension
:star:code - Understanding Multi-compositional learning in Vision and Language models via Category Theory
:star:code - Adversarial Prompt Tuning for Vision-Language Models
:star:code - Language-Image Pre-training with Long Captions
:star:code - CoReS: Orchestrating the Dance of Reasoning and Segmentation
:star:code
:house:project - Attention Prompting on Image for Large Vision-Language Models
:star:code - SILC: Improving Vision Language Pretraining with Self-Distillation
- SCLIP: Rethinking Self-Attention for Dense Vision-Language Inference
:star:code - AdaGlimpse: Active Visual Exploration with Arbitrary Glimpse Position and Scale
:star:code - Video-Language
- VLN
- LLM
- BLINK: Multimodal Large Language Models Can See but Not Perceive
:house:project - Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs
- X-InstructBLIP: A Framework for aligning X-Modal instruction-aware representations to LLMs and Emergent Cross-modal Reasoning
:star:code
:house:project - X-Former: Unifying Contrastive and Reconstruction Learning for MLLMs
- Instruction Tuning-free Visual Token Complement for Multimodal LLMs
- Merlin: Empowering Multimodal LLMs with Foresight Minds
:house:project - Groma: Localized Visual Tokenization for Grounding Multimodal Large Language Models
:house:project - MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems?
:star:code - UniCode: Learning a Unified Codebook for Multimodal Large Language Models
- When Do We Not Need Larger Vision Models?
:star:code - ControlLLM: Augment Language Models with Tools by Searching on Graphs
:star:code - Towards Open-Ended Visual Recognition with Large Language Models
:star:code - SPHINX: A Mixer of Weights, Visual Embeddings and Image Scales for Multi-modal Large Language Models
:star:code - ST-LLM: Large Language Models Are Effective Temporal Learners
:star:code - Dissecting Dissonance: Benchmarking Large Multimodal Models Against Self-Contradictory Instructions
:house:project - How Many Unicorns Are in This Image? A Safety Evaluation Benchmark for Vision LLMs
:star:code - BenchLMM: Benchmarking Cross-style Visual Capability of Large Multimodal Models
:star:code - MoAI: Mixture of All Intelligence for Large Language and Vision Models
:star:code
🤗huggingface - Paying More Attention to Images: A Training-Free Method for Alleviating Hallucination in LVLMs
:house:project - LLaVA-UHD: an LMM Perceiving any Aspect Ratio and High-Resolution Images
:star:code - Griffon: Spelling out All Object Locations at Any Granularity with Large Language Models
:star:code - LLMGA: Multimodal Large Language Model based Generation Assistant
:star:code
:house:project - Images are Achilles' Heel of Alignment: Exploiting Visual Vulnerabilities for Jailbreaking Multimodal Large Language Models
:star:code - LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models
:star:code - LLM as Dataset Analyst: Subpopulation Structure Discovery with Large Language Model
:house:project - Strengthening Multimodal Large Language Model with Bootstrapped Preference Optimization
- MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training
- LLaVA-Grounding: Grounded Visual Chat with Large Multimodal Models
:star:code - ShapeLLM: Universal 3D Object Understanding for Embodied Interaction
:star:code - Making Large Language Models Better Planners with Reasoning-Decision Alignment
- Self-Adapting Large Visual-Language Models to Edge Devices across Visual Modalities
- Propose, Assess, Search: Harnessing LLMs for Goal-Oriented Planning in Instructional Videos
:house:project - AdaShield: Safeguarding Multimodal Large Language Models from Structure-based Attack via Adaptive Shield Prompting
:star:code - DetToolChain: A New Prompting Paradigm to Unleash Detection Ability of MLLM
:star:code - GENIXER: Empowering Multimodal Large Language Models as a Powerful Data Generator
:star:code - Elysium: Exploring Object-level Perception in Videos through Semantic Integration Using MLLMs
:star:code
- BLINK: Multimodal Large Language Models Can See but Not Perceive
- 视觉定位
- Visual Grounding
- 视觉意图理解
- 引用表达理解
- 视觉语言理解
20.Scene
- LatentEditor: Text Driven Local Editing of 3D Scenes
:house:project - RoomTex: Texturing Compositional Indoor Scenes via Iterative Inpainting
:star:code室内场景 - Forecasting Future Videos from Novel Views via Disentangled 3D Scene Representation
:star:code - Compact 3D Scene Representation via Self-Organizing Gaussian Grids
:star:code - CARFF: Conditional Auto-encoded Radiance Field for 3D Scene Forecasting
:house:project - 场景合成
- Pyramid Diffusion for Fine 3D Large Scene Generation
:star:code
:house:project
:Thumbsup:西南交大&利兹大学等联合提出金字塔离散扩散模型(PDD),实现了3D户外场景生成的粗到细的策略 - External Knowledge Enhanced 3D Scene Generation from Sketch3D 场景生成
- SceneTeller: Language-to-3D Scene Generation
:star:code - Forest2Seq: Revitalizing Order Prior for Sequential Indoor Scene Synthesis
- Gaussian Grouping: Segment and Edit Anything in 3D Scenes
:star:code - EchoScene: Indoor Scene Generation via Information Echo over Scene Graph Diffusion
:star:code - AnyHome: Open-Vocabulary Large-Scale Indoor Scene Generation with First-Person View Exploration室内场景生成
- BeyondScene: Higher-Resolution Human-Centric Scene Generation With Pretrained Diffusion
:house:project - The Fabrication of Reality and Fantasy: Scene Generation with LLM-Assisted Prompt Interpretation
:star:code - Language-Driven Physics-Based Scene Synthesis and Editing via Feature Splatting
:house:project场景合成和编辑 - WoVoGen: World Volume-aware Diffusion for Controllable Multi-camera Driving Scene Generation
:star:code驾驶场景生成
- Pyramid Diffusion for Fine 3D Large Scene Generation
- 场景理解
- N2F2: Hierarchical Scene Understanding with Nested Neural Feature Fields
- Dense Multimodal Alignment for Open-Vocabulary 3D Scene Understanding
- SceneVerse: Scaling 3D Vision-Language Learning for Grounded Scene Understanding
:house:project - Shape2Scene: 3D Scene Representation Learning Through Pre-training on Shape Data
:star:code - nuCraft: Crafting High Resolution 3D Semantic Occupancy for Unified 3D Scene Understanding
- R3DS: Reality-linked 3D Scenes for Panoramic Scene Understanding
:house:project - Open Vocabulary 3D Scene Understanding via Geometry Guided Self-Distillation
:star:code - Agent3D-Zero: An Agent for Zero-shot 3D Understanding
:house:project - MTMamba: Enhancing Multi-Task Dense Scene Understanding by Mamba-Based Decoders
:star:code密集场景理解
- 语义场景完
- 场景图生成
- OpenPSG: Open-set Panoptic Scene Graph Generation via Large Multimodal Models
:star:code - Semantic Diversity-aware Prototype-based Learning for Unbiased Scene Graph Generation
:star:code - Fine-Grained Scene Graph Generation via Sample-Level Bias Prediction
:star:code - Expanding Scene Graph Boundaries: Fully Open-vocabulary Scene Graph Generation via Visual-Concept Alignment and Retention
:star:code
:thumbsup:突破场景图生成的边界:OvSGTR 实现全开放词汇场景图生成 - A Fair Ranking and New Model for Panoptic Scene Graph Generation
:house:project - Multi-Granularity Sparse Relationship Matrix Prediction Network for End-to-End Scene Graph Generation
:star:code - Towards Scene Graph Anticipation
:star:code
- OpenPSG: Open-set Panoptic Scene Graph Generation via Large Multimodal Models
19.UAV/Remote Sensing/Satellite Image(无人机/遥感/卫星图像)
- Masked Angle-Aware Autoencoder for Remote Sensing Images
:star:code - Radiance Field Learners As UAV First-Person Viewers
- Geospecific View Generation - Geometry-Context Aware High-resolution Ground View Inference from Satellite Views
:house:project卫星视图 - Probabilistic Image-Driven Traffic Modeling via Remote Sensing
- UAV First-Person Viewers Are Radiance Field Learners
:house:project - MutDet: Mutually Optimizing Pre-training for Remote Sensing Object Detection
:star:code - Towards Natural Language-Guided Drones: GeoText-1652 Benchmark with Spatial Relation Matching
- Free-Viewpoint Video of Outdoor Sports Using a Drone
- Learning Representations of Satellite Images From Metadata Supervision(https://github.com/preligens-lab/satmip)卫星图像
- Multi-scale Cross Distillation for Object Detection in Aerial Images
- LHRS-Bot: Empowering Remote Sensing with VGI-Enhanced Large Multimodal Language Model
:star:code - PDT Uav Target Detection Dataset for Pests and Diseases Tree
:star:code - Contrastive ground-level image and remote sensing pre-training improves representation learning for natural world imagery
🤗huggingface遥感 - Toward Open Vocabulary Aerial Object Detection with CLIP-Activated Student-Teacher Learning
:star:code
18.Automated Driving(自动驾驶)
- Online Vectorized HD Map Construction using Geometry
:star:code - MUSES: The Multi-Sensor Semantic Perception Dataset for Driving under Uncertainty
:star:code - HENet: Hybrid Encoding for End-to-end Multi-task 3D Perception from Multi-view Cameras
:star:code - Continuity Preserving Online CenterLine Graph Learning
- Accelerating Online Mapping and Behavior Prediction via Direct BEV Feature Attention
:star:code - RepVF: A Unified Vector Fields Representation for Multi-task 3D Perception
:star:code - MapDistill: Boosting Efficient Camera-based HD Map Construction via Camera-LiDAR Fusion Model Distillation
- Generative End-to-End Autonomous Driving
:star:code - CARB-Net: Camera-Assisted Radar-Based Network for Vulnerable Road User Detection
:star:code驾驶 - FipTR: A Simple yet Effective Transformer Framework for Future Instance Prediction in Autonomous Driving
:star:code - Mask2Map: Vectorized HD Map Construction Using Bird's Eye View Segmentation Masks
:star:code驾驶 - Gated Temporal Diffusion for Stochastic Long-Term Dense Anticipation
:star:code - CarFormer: Self-Driving with Learned Object-Centric Representations
:star:code - Image-to-Lidar Relational Distillation for Autonomous Driving Data
- Modelling Competitive Behaviors in Autonomous Driving Under Generative World Model
:star:code - LingoQA: Video Question Answering for Autonomous Driving
:star:code - PreSight: Enhancing Autonomous Vehicle Perception with City-Scale NeRF Priors
:star:code - VQA-Diff: Exploiting VQA and Diffusion for Zero-Shot Image-to-3D Vehicle Asset Generation in Autonomous Driving
- TCLC-GS: Tightly Coupled LiDAR-Camera Gaussian Splatting for Autonomous Driving
- Learning to Drive via Asymmetric Self-Play
:house:project - Embodied Understanding of Driving Scenarios
:star:code - Early Anticipation of Driving Maneuvers
:house:project - RealGen: Retrieval Augmented Generation for Controllable Traffic Scenarios
:house:project - Event-Aided Time-To-Collision Estimation for Autonomous Driving
:house:project - Dolphins: Multimodal Language Model for Driving
:house:project - PPAD: Iterative Interactions of Prediction and Planning for End-to-end Autonomous Driving
:star:code - Asynchronous Large Language Model Enhanced Planner for Autonomous Driving
:star:code - Neural Volumetric World Models for Autonomous Driving
- SeFlow: A Self-Supervised Scene Flow Method in Autonomous Driving
:star:code - Random Walk on Pixel Manifolds for Anomaly Segmentation of Complex Driving Scenes
:star:code自动驾驶 - SLEDGE: Synthesizing Driving Environments with Generative Models and Rule-Based Traffic
- I Can't Believe It's Not Scene Flow!
:star:code场景流 - Safe-Sim: Safety-Critical Closed-Loop Traffic Simulation with Diffusion-Controllable Adversaries
:house:project交通 - UniM2AE: Multi-modal Masked Autoencoders with Unified 3D Representation for 3D Perception in Autonomous Driving
:star:code - DriveDreamer: Towards Real-world-driven World Models for Autonomous Driving
:house:project - Think2Drive: Efficient Reinforcement Learning by Thinking with Latent World Model for Autonomous Driving (in CARLA-v2)
:house:project - Reason2Drive: Towards Interpretable and Chain-based Reasoning for Autonomous Driving
:star:code - Improving Agent Behaviors with RL Fine-tuning for Autonomous Driving
- Lane Graph as Path: Continuity-preserving Path-wise Modeling for Online Lane Graph Construction
:star:code - Beyond the Data Imbalance: Employing the Heterogeneous Datasets for Vehicle Maneuver Prediction
:star:code - 轨迹预测
- Learning Semantic Latent Directions for Accurate and Controllable Human Motion Prediction
:star:code - NeRMo: Learning Implicit Neural Representations for 3D Human Motion Prediction
- CoMusion: Towards Consistent Stochastic Human Motion Prediction via Motion Diffusion人体运动预测
- Risk-Aware Self-Consistent Imitation Learning for Trajectory Planning in Autonomous Driving
- Progressive Pretext Task Learning for Human Trajectory Prediction
:star:code - DySeT: a Dynamic Masked Self-distillation Approach for Robust Trajectory Prediction
- VisionTrap: Vision-Augmented Trajectory Prediction Guided by Textual Descriptions
:star:code - Optimizing Diffusion Models for Joint Trajectory Prediction and Controllable Generation
:star:code - Adaptive Human Trajectory Prediction via Latent Corridors
:house:project - NeuroNCAP: Photorealistic Closed-loop Safety Testing for Autonomous Driving
:star:code - MART: MultiscAle Relational Transformer Networks for Multi-agent Trajectory Prediction
:star:code - Local Occupancy-Enhanced Object Grasping with Multiple Triplanar Projection
- 车辆轨迹预测
- Learning Semantic Latent Directions for Accurate and Controllable Human Motion Prediction
- 占据预测
- VEON: Vocabulary-Enhanced Occupancy Prediction
- OccWorld: Learning a 3D Occupancy World Model for Autonomous Driving
:star:code - OccGen: Generative Multi-modal 3D Occupancy Prediction for Autonomous Driving
:house:project - Fully Sparse 3D Occupancy Prediction
:star:code - Monocular Occupancy Prediction for Scalable Indoor Scenes
:star:code - ViewFormer: Exploring Spatiotemporal Modeling for Multi-View 3D Occupancy Perception via View-Guided Transformers
:star:code - CVT-Occ: Cost Volume Temporal Fusion for 3D Occupancy Prediction
:star:code - GaussianFormer: Scene as Gaussians for Vision-Based 3D Semantic Occupancy Prediction
:star:code3D 语义占用预测
- 车道线检测
- 车辆监控
17.Video
- Stable Video Portraits
:house:project - Text-Guided Video Masked Autoencoder
- Multi-Modal Video Dialog State Tracking in the Wild
- Training-free Video Temporal Grounding using Large-scale Pre-trained Models
:star:code - Weakly-Supervised Spatio-Temporal Video Grounding with Variational Cross-Modal Alignment
- E3M: Zero-Shot Spatio-Temporal Video Grounding with Expectation-Maximization Multimodal Modulation
:star:code - Rethinking Weakly-supervised Video Temporal Grounding From a Game Perspective
- Fast Encoding and Decoding for Implicit Video Representation
:star:code
:house:project - DEVIAS: Learning Disentangled Video Representations of Action and Scene
:star:code - VideoStudio: Generating Consistent-Content and Multi-Scene Videos
:house:project - VAD
- Cross-Domain Learning for Video Anomaly Detection with Limited Supervision
- Learning Anomalies with Normality Prior for Unsupervised Video Anomaly Detection
:star:code - Follow the Rules: Reasoning for Video Anomaly Detection with Large Language Models
:star:code - FedVAD: Enhancing Federated Video Anomaly Detection with GPT-Driven Semantic Distillation
:star:code - Interleaving One-Class and Weakly-Supervised Models with Adaptive Thresholding for Unsupervised Video Anomaly Detection
:star:code视频异常检测
- 视频摘要
- 视频理解
- VideoMamba: Spatio-Temporal Selective State Space Model
:star:code - VideoMamba: State Space Model for Efficient Video Understanding
:star:code - Goldfish: Vision-Language Understanding of Arbitrarily Long Videos
:star:code - Learning Video Context as Interleaved Multimodal Sequences
:star:code - FunQA: Towards Surprising Video Comprehension
:house:project - Vamos: Versatile Action Models for Video Understanding
:star:code
:house:project - Towards Neuro-Symbolic Video Understanding
:star:code - LongVLM: Efficient Long Video Understanding via Large Language Models
:star:code - VideoAgent: A Memory-augmented Multimodal Agent for Video Understanding
:star:code
:house:project - Ex2Eg-MAE: A Framework for Adaptation of Exocentric Video Masked Autoencoders for Egocentric Social Role Understanding
- VideoAgent: Long-form Video Understanding with Large Language Model as Agent
🤗huggingface - InternVideo2: Scaling Foundation Models for Multimodal Video Understanding
:star:code - Text-Conditioned Resampler For Long Form Video Understanding
- VideoMamba: Spatio-Temporal Selective State Space Model
- 视频分类
- 视频解析
- 视频帧插值
- 视频类增量
- 视频抄袭片段定位
16.Medical Image Progress(医学影响处理)
- Adaptive Compressed Sensing with Diffusion-Based Posterior Sampling
:star:code - Co-synthesis of Histopathology Nuclei Image-Label Pairs using a Context-Conditioned Joint Diffusion Model
- Identity-Consistent Diffusion Network for Grading Knee Osteoarthritis Progression in Radiographic Imaging
- Multistain Pretraining for Slide Representation Learning in Pathology
:star:code - Energy-induced Explicit quantification for Multi-modality MRI fusion
:star:code - Brain-ID: Learning Contrast-agnostic Anatomical Representations for Brain Imaging
:star:code - CardiacNet: Learning to Reconstruct Abnormalities for Cardiac Disease Assessment from Echocardiogram Videos
:star:code心脏病评估 - Knowledge-enhanced Visual-Language Pretraining for Computational Pathology
:star:code - Effective Lymph Nodes Detection in CT Scans Using Location Debiased Query Selection and Contrastive Query Representation in TransformerCT
- Bridging the Pathology Domain Gap: Efficiently Adapting CLIP for Pathology Image Analysis with Limited Labeled Data病理学图像分析
- Unified Medical Image Pre-training in Language-Guided Common Semantic Space
- Rethinking Deep Unrolled Model for Accelerated MRI Reconstruction
:star:code - Style-Extracting Diffusion Models for Semi-Supervised Histopathology Segmentation
:star:code半监督组织病理学分割 - 组织病理学图像分类
- 切片图像分类
- DGR-MIL: Exploring Diverse Global Representation in Multiple Instance Learning for Whole Slide Image Classification
:star:code - Pathology-knowledge Enhanced Multi-instance Prompt Learning for Few-shot Whole Slide Image Classification
- Snuffy: Efficient Whole Slide Image Classifier
- Norma: A Noise Robust Memory-Augmented Framework for Whole Slide Image Classification
:star:code - Attention-Challenging Multiple Instance Learning for Whole Slide Image Classification
:star:code
- DGR-MIL: Exploring Diverse Global Representation in Multiple Instance Learning for Whole Slide Image Classification
- 医学图像分割
- FairDomain: Achieving Fairness in Cross-Domain Medical Image Segmentation and Classification
:house:project
:star:code - PMT: Progressive Mean Teacher via Exploring Temporal Consistency for Semi-Supervised Medical Image Segmentation
:star:code - The Devil is in the Statistics: Mitigating and Exploiting Statistics Difference for Generalizable Semi-supervised Medical Image Segmentation
:star:code - Domesticating SAM for Breast Ultrasound Image Segmentation via Spatial-frequency Fusion and Uncertainty Correction
:star:code - Gradient-Aware for Class-Imbalanced Semi-supervised Medical Image Segmentation
:star:code - AnatoMask: Enhancing Medical Image Segmentation with Reconstruction-guided Self-masking
:star:code - Alternate Diverse Teaching for Semi-supervised Medical Image Segmentation
:star:code - I-MedSAM: Implicit Medical Image Segmentation with Segment Anything
:star:code - VP-SAM: Taming Segment Anything Model for Video Polyp Segmentation via Disentanglement and Spatio-temporal Side Network
:star:code息肉分割
- FairDomain: Achieving Fairness in Cross-Domain Medical Image Segmentation and Classification
- 医学图像配准
- 医学报告生成
- X 光片
- 医学机器人
- 生物医学图像
- CT
15.GAN/Image Synthesis(图像生成)
- Diffusion Models as Data Mining Tools
:star:code - ProCreate, Don't Reproduce! Propulsive Energy Diffusion for Creative Generation
:house:project - Learning to Generate Conditional Tri-plane for 3D-aware Expression Controllable Portrait Animation
:house:project - HiEI: A Universal Framework for Generating High-quality Emerging Images from Natural Images
- UDiffText: A Unified Framework for High-quality Text Synthesis in Arbitrary Images via Character-aware Diffusion Models
:star:code - Score Distillation Sampling with Learned Manifold Corrective
- CTRLorALTer: Conditional LoRAdapter for Efficient 0-Shot Control & Altering of T2I Models
:house:project - EMO: Emote Portrait Alive - Generating Expressive Portrait Videos with Audio2Video Diffusion Model under Weak Conditions
- Inf-DiT: Upsampling any-resolution image with memory-efficient diffusion transformer
:star:code - The Lottery Ticket Hypothesis in Denoising: Towards Semantic-Driven Initialization
:house:project - MONTAGE: Monitoring Training for Attribution of Generative Diffusion Models
- TP2O: Creative Text Pair-to-Object Generation using Balance Swap-Sampling
- Free-ATM: Harnessing Free Attention Masks for Representation Learning on Diffusion-Generated Images
- Idea2Img: Iterative Self-Refinement with GPT-4V for Automatic Image Design and Generation
:house:project - OMG: Occlusion-friendly Personalized Multi-concept Generation in Diffusion Models
:star:code - DreamDiffusion: High-Quality EEG-to-Image Generation with Temporal Masked Signal Modeling and CLIP Alignment
- V-Trans4Style: Visual Transition Recommendation for Video Production Style Adaptation视频制作风格适配的视觉转场推荐
- GAN
- CLR-GAN: Improving GANs Stability and Quality via Consistent Latent Representation and Reconstruction
:star:code - A Closer Look at GAN Priors: Exploiting Intermediate Features for Enhanced Model Inversion Attacks
:star:code - Distilling Diffusion Models into Conditional GANs
- Exploring Guided Sampling of Conditional GANs
:star:code - Learning 3D-aware GANs from Unposed Images with Template Feature Field
:house:project
- CLR-GAN: Improving GANs Stability and Quality via Consistent Latent Representation and Reconstruction
- 扩散
- Measuring Style Similarity in Diffusion Models
:star:code - Do text-free diffusion models learn discriminative visual representations
- Iterative Ensemble Training with Anti-Gradient Control for Mitigating Memorization in Diffusion Models
:star:code - ShoeModel: Learning to Wear on the User-specified Shoes via Diffusion Model
- HiDiffusion: Unlocking Higher-Resolution Creativity and Efficiency in Pretrained Diffusion Models
:star:code - Chains of Diffusion Models
- To Generate or Not? Safety-Driven Unlearned Diffusion Models Are Still Easy To Generate Unsafe Images ... For Now
:star:code - FontStudio: Shape-Adaptive Diffusion Model for Coherent and Consistent Font Effect Generation
:house:project - Beta-Tuned Timestep Diffusion Model
- SMooDi: Stylized Motion Diffusion Model
:house:project - Make a Cheap Scaling: A Self-Cascade Diffusion Model for Higher-Resolution Adaptation
:star:code - Surf-D: Generating High-Quality Surfaces of Arbitrary Topologies Using Diffusion Models
:star:code
:house:project - Implicit Concept Removal of Diffusion Models
:house:project - ZigMa: A DiT-style Zigzag Mamba Diffusion Model
:star:code - ColorPeel: Color Prompt Learning with Diffusion Models via Color and Shape Disentanglement
:star:code
:house:project - Timestep-Aware Correction for Quantized Diffusion Models
- Shapefusion: 3D localized human diffusion models
:house:project - MVDD: Multi-View Depth Diffusion Models
- SwiftBrush v2: Make Your One-step Diffusion Model Better Than Its Teacher
:star:code
:house:project
:tv:video - Layout-Corrector: Alleviating Layout Sticking Phenomenon in Discrete Diffusion Model
:star:code - Compensation Sampling for Improved Convergence in Diffusion Models
:star:code - ConceptExpress: Harnessing Diffusion Models for Single-image Unsupervised Concept Extraction
:star:code
:star:code - Self-Guided Generation of Minority Samples Using Diffusion Models
:star:code
- Measuring Style Similarity in Diffusion Models
- Zero-Shot Adaptation for Approximate Posterior Sampling of Diffusion Models in Inverse Problems
:star:code - LogoSticker: Inserting Logos into Diffusion Models for Customized Generation
:star:code - 纹理合成
- 图像合成
- Editable Image Elements for Controllable Synthesis
:house:project - Assessing Sample Quality via the Latent Space of Generative Models
:star:code - SCP-Diff: Photo-Realistic Semantic Image Synthesis with Spatial-Categorical Joint Prior
:house:project - Zero-shot Text-guided Infinite Image Synthesis with LLM guidance
- -Brush: Controllable Large Image Synthesis with Diffusion Models in Infinite Dimensions
:star:code
:star:code - EpipolarGAN: Omnidirectional Image Synthesis with Explicit Camera Control
- LayerDiff: Exploring Text-guided Multi-layered Composable Image Synthesis via Layer-Collaborative Diffusion Model
- Layered Rendering Diffusion Model for Controllable Zero-Shot Image Synthesis
- Label-free Neural Semantic Image Synthesis
- Improving image synthesis with diffusion-negative sampling
- SCP-Diff: Spatial-Categorical Joint Prior for Diffusion Based Semantic Image Synthesis
:house:project - 2S-ODIS: Two-Stage Omni-Directional Image Synthesis by Geometric Distortion Correction
:star:code - Rejection Sampling IMLE: Designing Priors for Better Few-Shot Image Synthesis
- FouriScale: A Frequency Perspective on Training-Free High-Resolution Image Synthesis
:star:code
- Editable Image Elements for Controllable Synthesis
- 图像生成
- MoMA: Multimodal LLM Adapter for Fast Personalized Image Generation
:star:code
:house:project - Few-Shot Image Generation by Conditional Relaxing Diffusion Inversion
- Context Diffusion: In-Context Aware Image Generation
:house:project - Few-shot Defect Image Generation based on Consistency Modeling
:star:code - Linearly Controllable GAN: Unsupervised Feature Categorization and Decomposition for Image Generation and Manipulation
- PanoFree: Tuning-Free Holistic Multi-view Image Generation with Cross-view Self-Guidance
- AdaNAT: Exploring Adaptive Policy for Token-Based Image Generation
:star:code - AccDiffusion: An Accurate Method for Higher-Resolution Image Generation
:house:project
:thumbsup:成功地进行无重复高分辨率的图像生成 - Towards Reliable Advertising Image Generation Using Human Feedback
:star:code - StoryImager: A Unified and Efficient Framework for Coherent Story Visualization and Completion
:star:code - Model-agnostic Origin Attribution of Generated Images with Few-shot Examples
- Improving Geo-diversity of Generated Images with Contextualized Vendi Score Guidance
:star:code - Tuning-Free Image Customization with Image and Text Guidance
:house:project - Collaborative Control for Geometry-Conditioned PBR Image Generation
:house:project - DiffiT: Diffusion Vision Transformers for Image Generation
:star:code - MultiGen: Zero-shot Image Generation from Multi-modal Prompts
- Accelerating Image Generation with Sub-path Linear Approximation Model
:star:code
- MoMA: Multimodal LLM Adapter for Fast Personalized Image Generation
- 视频生成
- FreeInit: Bridging Initialization Gap in Video Diffusion Models
:star:code
:house:project - HARIVO: Harnessing Text-to-Image Models for Video Generation
:house:project - SignGen: End-to-End Sign Language Video Generation with Latent Diffusion
:star:code - DragAnything: Motion Control for Anything using Entity Representation
:star:code
:house:project - Physics-Based Interaction with 3D Objects via Video Generation
:star:code
:house:project - DrivingDiffusion: Layout-Guided Multi-View Driving Scenarios Video Generation with Latent Diffusion Model
:house:project - Photorealistic Video Generation with Diffusion Models
:house:project - DynamiCrafter: Animating Open-domain Images with Video Diffusion Priors
:star:code - PoseCrafter: One-Shot Personalized Video Synthesis Following Flexible Pose Control
:house:project - MoVideo: Motion-Aware Video Generation with Diffusion Models
:house:project - IDOL: Unified Dual-Modal Latent Diffusion for Human-Centric Joint Video-Depth Generation
:star:code - MagDiff: Multi-Alignment Diffusion for High-Fidelity Video Generation and Editing
- 文本-视频质量评估
- FreeInit: Bridging Initialization Gap in Video Diffusion Models
- 视频编辑
- DragVideo: Interactive Drag-style Video Editing
:house:project - Video Editing via Factorized Diffusion Distillation
- SAVE: Protagonist Diversification with Structure Agnostic Video Editing
:house:project - DNI: Dilutional Noise Initialization for Diffusion Video Editing
- DeCo: Decoupled Human-Centered Diffusion Video Editing with Motion Consistency
- DreamMotion: Space-Time Self-Similar Score Distillation for Zero-Shot Video Editing
:house:project - Videoshop: Localized Semantic Video Editing with Noise-Extrapolated Diffusion Inversion
:star:code - Object-Centric Diffusion for Efficient Video Editing
:house:project
- DragVideo: Interactive Drag-style Video Editing
- 图像编辑
- ObjectAdd: Adding Objects into Image via a Training-Free Diffusion Modification Fashion
- Guide-and-Rescale: Self-Guidance Mechanism for Effective Tuning-Free Real Image Editing
🤗huggingface - COMPOSE: Comprehensive Portrait Shadow Editing
- Fast Diffusion-Based Counterfactuals for Shortcut Removal and Generation编辑
- FreeDiff: Progressive Frequency Truncation for Image Editing with Diffusion Models
:star:code - ByteEdit: Boost, Comply and Accelerate Generative Image Editing
- RegionDrag: Fast Region-Based Image Editing with Diffusion Models
:star:code - 3DEgo: 3D Editing on the Go!
:star:code - View-Consistent 3D Editing with Gaussian Splatting
:house:project - Diffusion Models are Geometry Critics: Single Image 3D Editing Using Pre-Trained Diffusion Priors
:house:project - Watch Your Steps: Local Image and Scene Editing by Text Instructions
:house:project - Chat-Edit-3D: Interactive 3D Scene Editing via Text Prompts
:house:project - Free-Editor: Zero-shot Text-driven 3D Scene Editing
:house:project - InstructGIE: Towards Generalizable Image Editing
- Lazy Diffusion Transformer for Interactive Image Editing
:house:project - DATENeRF: Depth-Aware Text-based Editing of NeRFs
:star:code
:house:project - TurboEdit: Real-time text-based disentangled real image editing
- DGE: Direct Gaussian 3D Editing by Consistent Multi-view Editing
:star:code - StableDrag: Stable Dragging for Point-based Image Editing
- ST-LDM: A Universal Framework for Text-Grounded Object Generation in Real Images图像编辑
- SwapAnything: Enabling Arbitrary Object Swapping in Personalized Image Editing
:house:project - Source Prompt Disentangled Inversion for Boosting Image Editability with Diffusion Models
:star:code - Robust-Wide: Robust Watermarking against Instruction-driven Image Editing
:star:code - FlexiEdit: Frequency-Aware Latent Refinement for Enhanced Non-Rigid Editing
- Eta Inversion: Designing an Optimal Eta Function for Diffusion-based Real Image Editing
:star:code - RadEdit: stress-testing biomedical vision models via diffusion image editing
- Responsible Visual Editing
:star:code - 3D-GOI: 3D GAN Omni-Inversion for Multifaceted and Multi-object Editing
:house:project - Thinking Outside the BBox: Unconstrained Generative Object Compositing物体合成
- EditShield: Protecting Unauthorized Image Editing by Instruction-guided Diffusion Models
- 图像-视频
- Rethinking Image-to-Video Adaptation: An Object-centric Perspective
- R2-Tuning: Efficient Image-to-Video Transfer Learning for Video Temporal Grounding
:star:code - PhysGen: Rigid-Body Physics-Grounded Image-to-Video Generation
:star:code - ZeroI2V: Zero-Cost Adaptation of Pre-Trained Transformers from Image to Video
- 文本-视频
- E.T. the Exceptional Trajectories: Text-to-camera-trajectory generation with character awareness
:house:project - WAVE: Warping DDIM Inversion Features for Zero-shot Text-to-Video Editing
:house:project - MotionDirector: Motion Customization of Text-to-Video Diffusion Models
:star:code
:house:project - Factorizing Text-to-Video Generation by Explicit Image Conditioning
🤗huggingface - SparseCtrl: Adding Sparse Controls to Text-to-Video Diffusion Models
:house:project - MEVG: Multi-event Video Generation with Text-to-Video Models
:house:project - Customize-A-Video: One-Shot Motion Customization of Text-to-Video Diffusion Models
:house:project - xGen-VideoSyn-1: High-fidelity Text-to-Video Synthesis with Compressed Representations
- E.T. the Exceptional Trajectories: Text-to-camera-trajectory generation with character awareness
- 文本-3D
- Diverse Text-to-3D Synthesis with Augmented Text Embedding
:star:code
:house:project - LATTE3D: Large-scale Amortized Text-To-Enhanced3D Synthesis
:house:project - DreamMesh: Jointly Manipulating and Texturing Triangle Meshes for Text-to-3D Generation
:star:code - DreamReward: Aligning Human Preference in Text-to-3D Generation
:house:project - DreamDissector: Learning Disentangled Text-to-3D Generation from 2D Diffusion Priors
:star:code - DreamReward: Text-to-3D Generation with Human Preference
:house:project - GVGEN: Text-to-3D Generation with Volumetric Representation
:star:code - WordRobe: Text-Guided Generation of Textured 3D Garments
:house:project - UniDream: Unifying Diffusion Priors for Relightable Text-to-3D Generation
:star:code
:house:project - ScaleDreamer: Scalable Text-to-3D Synthesis with Asynchronous Score Distillation
:star:code - CRM: Single Image to 3D Textured Mesh with Convolutional Reconstruction Model
:house:project - Repaint123: Fast and High-quality One Image to 3D Generation with Progressive Controllable Repainting
:house:project - VividDreamer: Invariant Score Distillation for Hyper-Realistic Text-to-3D Generation
- DreamScene: 3D Gaussian-based Text-to-3D Scene Generation via Formation Pattern Sampling
:star:code - HiFi-123: Towards High-fidelity One Image to 3D Content Generation
:house:project - JointDreamer: Ensuring Geometry Consistency and Text Congruence in Text-to-3D Generation via Joint Score Distillation
- Connecting Consistency Distillation to Score Distillation for Text-to-3D Generation
:star:code - TPA3D: Triplane Attention for Fast Text-to-3D Generation
:house:project - DreamView: Injecting View-specific Text Guidance into Text-to-3D Generation
:star:code
- Diverse Text-to-3D Synthesis with Augmented Text Embedding
- 文本-图像
- [Navigating Text-to-lmage Generative Bias acrossIndic Languages]
- Bridging Different Language Models and Generative Vision Models for Text-to-Image Generation
:star:code
:house:project - MobileDiffusion: Instant Text-to-Image Generation on Mobile Devices
- PreciseControl: Enhancing Text-To-Image Diffusion Models with Fine-Grained Attribute Control
:house:project - PEA-Diffusion: Parameter-Efficient Adapter with Knowledge Distillation in non-English Text-to-Image Generation
:star:code - Diffusion Soup: Model Merging for Text-to-Image Diffusion Models
- IMMA: Immunizing text-to-image Models against Malicious Adaptation
:star:code - Receler: Reliable Concept Erasing of Text-to-Image Diffusion Models via Lightweight Erasers
:house:project - ControlNet-XS: Rethinking the Control of Text-to-Image Diffusion Models as Feedback-Control Systems
- Skews in the Phenomenon Space Hinder Generalization in Text-to-Image Generation
:star:code - Enhancing Semantic Fidelity in Text-to-Image Synthesis: Attention Regulation in Diffusion Models
:star:code - Post-training Quantization with Progressive Calibration and Activation Relaxing for Text-to-Image Diffusion Models
- Textual-Visual Logic Challenge: Understanding and Reasoning in Text-to-Image Generation
:star:code - Navigating Text-to-Image Generative Bias across Indic Languages
:house:project - Safeguard Text-to-Image Diffusion Models with Human Feedback Inversion
:star:code - Diff-Tracker: Text-to-Image Diffusion Models are Unsupervised Trackers
- DreamDrone: Text-to-Image Diffusion Models are Zero-shot Perpetual View Generators
:house:project - Harnessing Text-to-Image Diffusion Models for Category-Agnostic Pose Estimation
- MaxFusion: Plug&Play Multi-Modal Generation in Text-to-Image Diffusion Models
:star:code
:house:project - Deep Reward Supervisions for Tuning Text-to-Image Diffusion Models
- R.A.C.E.: Robust Adversarial Concept Erasure for Secure Text-to-Image Diffusion Model
:star:code - MixDQ: Memory-Efficient Few-Step Text-to-Image Diffusion Models with Metric-Decoupled Mixed Precision Quantization
:star:code
:house:project - ReCON: Training-Free Acceleration for Text-to-Image Synthesis with Retrieval of Concept Prompt Trajectories
:house:project - LCM-Lookahead for Encoder-based Text-to-Image Personalization
:star:code
:house:project - Lego: Learning to Disentangle and Invert Personalized Concepts Beyond Object Appearance in Text-to-Image Diffusion Models
- Reliable and Efficient Concept Erasure of Text-to-Image Diffusion Models
:star:code - Exploring Phrase-Level Grounding with Text-to-Image Diffusion Model
:star:code
:thumbsup:DiffPNG实现了最佳的性能,证明了T2I扩散模型在短语级理解视觉内容的能力 - T2IShield: Defending Against Backdoors on Text-to-Image Diffusion Models
:star:code - Be Yourself: Bounded Attention for Multi-Subject Text-to-Image Generation
:house:project - Latent Guard: a Safety Framework for Text-to-image Generation
:star:code - Getting it Right: Improving Spatial Consistency in Text-to-Image Models
:star:code
:house:project - Unveiling and Mitigating Memorization in Text-to-image Diffusion Models through Cross Attention
:star:code - Powerful and Flexible: Personalized Text-to-Image Generation via Reinforcement Learning
:star:code - PanGu-Draw: Advancing Resource-Efficient Text-to-Image Synthesis with Time-Decoupled Training and Reusable Coop-Diffusion
:house:project - Text-Anchored Score Composition: Tackling Condition Misalignment in Text-to-Image Diffusion Models
:star:code - MasterWeaver: Taming Editability and Face Identity for Personalized Text-to-Image Generation
:star:code - Parrot: Pareto-optimal Multi-Reward Reinforcement Learning Framework for Text-to-Image Generation
- ProTIP: Probabilistic Robustness Verification on Text-to-Image Diffusion Models against Stochastic Perturbation
- PixArt-Sigma: Weak-to-Strong Training of Diffusion Transformer for 4K Text-to-Image Generation
- AnyControl: Create Your Artwork with Versatile Control on Text-to-Image Generation
:star:code
:house:project - CogView3: Finer and Faster Text-to-Image Generation via Relay Diffusion
:star:code - Object-Conditioned Energy-Based Attention Map Alignment in Text-to-Image Diffusion Models
- SpeedUpNet: A Plug-and-Play Adapter Network for Accelerating Text-to-Image Diffusion Models
:house:project - TIBET: Identifying and Evaluating Biases in Text-to-Image Generative Models
:house:project - An Empirical Study and Analysis of Text-to-Image Generation Using Large Language Model-Powered Textual Representation
- Adversarial Robustification via Text-to-Image Diffusion Models
:star:code - Stable Preference: Redefining training paradigm of human preference model for Text-to-Image Synthesis
- Lost in Translation: Latent Concept Misalignment in Text-to-Image Diffusion Models
:house:project - Text-to-Sticker: Style Tailoring Latent Diffusion Models for Human Expression
- 图像-文本
- 文本-视频对齐
- 图像-文本对齐
- 图像-文本
- 3D(内容)生成
- Make-Your-3D: Fast and Consistent Subject-Driven 3D Content Generation
:house:project - LGM: Large Multi-View Gaussian Model for High-Resolution 3D Content Creation
:house:project - LN3Diff: Scalable Latent Neural Fields Diffusion for Speedy 3D Generation
:house:project - SV3D: Novel Multi-view Synthesis and 3D Generation from a Single Image using Latent Video Diffusion
:house:project - VFusion3D: Learning Scalable 3D Generative Models from Video Diffusion Models
:house:project - Compress3D: a Compressed Latent Space for 3D Generation from a Single Image
- AnimatableDreamer: Text-Guided Non-rigid 3D Model Generation and Reconstruction with Canonical Score Distillation
:house:project - Learn to Optimize Denoising Scores: A Unified and Improved Diffusion Prior for 3D Generation
:house:project
- Make-Your-3D: Fast and Consistent Subject-Driven 3D Content Generation
- 视觉文本渲染
- GIF 生成
- 布局生成
- 布局-图像
- 图像-图像翻译
- 图像翻译
- Text-to-4D
- Video-to-4D
- 网页设计
- Text-to-Garment
- 图像风格化
- StyleTokenizer: Defining Image Style by a Single Instance for Controlling Diffusion Models
:star:code - ZipLoRA: Any Subject in Any Style by Effectively Merging LoRAs
:star:code
:house:project风格 - InstaStyle: Inversion Noise of a Stylized Image is Secretly a Style Adviser
:star:code风格化 - StyleCity: Large-Scale 3D Urban Scenes Stylization
:house:project城市场景风格化 - Scene-Conditional 3D Object Stylization and Composition
- StyleTokenizer: Defining Image Style by a Single Instance for Controlling Diffusion Models
- 图像矢量化
- 视频拼接
- 文本到相机轨迹生成
- 文本到 3D 场景
- 身份保留的个性化
- 主题驱动生成
- 风格内容分离
- 文本生成多运动
- 文本驱动的3D编辑
- GaussCtrl: Multi-View Consistent Text-Driven 3D Gaussian Splatting Editing
:star:code文本驱动的 3D 高斯泼溅编辑
- GaussCtrl: Multi-View Consistent Text-Driven 3D Gaussian Splatting Editing
- 图像插值
- 图像合成
- 图像动画
- LivePhoto: Real Image Animation with Text-guided Motion Control
:star:code文本引导运动控制的真实图像动画 - MOFA-Video: Controllable Image Animation via Generative Motion Field Adaptions in Frozen Image-to-Video Diffusion Model
:star:code
:house:project - ZoLA: Zero-Shot Creative Long Animation Generation with Short Video Model
:house:project - Champ: Controllable and Consistent Human Image Animation with 3D Parametric Guidance
:star:code人体图像动画 - TCAN: Animating Human Images with Temporally Consistent Pose Guidance using Diffusion Models
:house:project
- LivePhoto: Real Image Animation with Text-guided Motion Control
- 集体照合成
- 图像裁剪
14.Image Captioning(图像/视频字幕)
- DECap: Towards Generalized Explicit Caption Editing via Diffusion Mechanism
- ControlCap: Controllable Region-level Captioning
:star:code字幕 - MarineInst: A Foundation Model for Marine Image Analysis with Instance Visual Description
:star:code视觉描述 - Distractors-Immune Representation Learning with Cross-modal Contrastive Regularization for Change Captioning
:star:code - CIC-BART-SSA: Controllable Image Captioning with Structured Semantic Augmentation
:star:code - Controllable Contextualized Image Captioning: Directing the Visual Narrative through User-Defined Highlights
:star:code - BRIDGE: Bridging Gaps in Image Captioning Evaluation with Stronger Visual Cues
:star:code - View Selection for 3D Captioning via Diffusion Ranking
:star:code - Unleashing Text-to-Image Diffusion Prior for Zero-Shot Image Captioning
:house:project - HiFi-Score: Fine-grained Image Description Evaluation with Hierarchical Parsing Graphs细粒度图像描述
- 视频字幕
- 密集字幕
13.Image/video Compression(图像/视频压缩)
- SAH-SCI: Self-Supervised Adapter for Efficient Hyperspectral Snapshot Compressive Imaging
- Adaptive Selection of Sampling-Reconstruction in Fourier Compressed Sensing
- Image Compression for Machine and Human Vision with Spatial-Frequency Adaptation
:star:code - Bidirectional Stereo Image Compression with Cross-Dimensional Entropy Model
- Rate-Distortion-Cognition Controllable Versatile Neural Image Compression
- BaSIC: BayesNet Structure Learning for Computational Scalable Neural Image Compression
:star:code - Region-Adaptive Transform with Segmentation Prior for Image Compression
:star:code - EGIC: Enhanced Low-Bit-Rate Generative Image Compression Guided by Semantic Segmentation
- Lagrangian Hashing for Compressed Neural Field Representations
:house:project - Latent Diffusion Prior Enhanced Deep Unfolding for Snapshot Spectral Compressive Imaging
:star:code快照光谱压缩 - Image Compression for Machine and Human Vision With Spatial-Frequency Adaptation
:star:code - Learned HDR Image Compression for Perceptually Optimal Storage and Display
:star:code - WeConvene: Learned Image Compression with Wavelet-Domain Convolution and Entropy Model
- Lossy Image Compression with Foundation Diffusion Models
- A Unified Image Compression Method for Human Perception and Multiple Vision Tasks
- 视频压缩
- Hierarchical Separable Video Transformer for Snapshot Compressive Imaging
:star:code - A Simple Low-bit Quantization Framework for Video Snapshot Compressive Imaging
:star:code - Free-VSC: Free Semantics from Visual Foundation Models for Unsupervised Video Semantic Compression
- Long-term Temporal Context Gathering for Neural Video Compression
- Learned Rate Control for Frame-Level Adaptive Neural Video Compression via Dynamic Neural Network
- Hierarchical Separable Video Transformer for Snapshot Compressive Imaging
- 视频解码
- 快照光谱成像
- 运动估计
12.Image Retrieval(图像检索)
- RGNet: A Unified Clip Retrieval and Grounding Network for Long Videos
:star:code
:house:project - AMES: Asymmetric and Memory-Efficient Similarity Estimation for Instance-level Retrieval
:star:code - IRGen: Generative Modeling for Image Retrieval
:star:code - FastCAD: Real-Time CAD Retrieval and Alignment from Scans and Videos
- Spherical Linear Interpolation and Text-Anchoring for Zero-shot Composed Image Retrieval
:star:code - FreestyleRet: Retrieving Images from Style-Diversified Queries
:star:code - 基于草图的图像检索
- 视频-文本检索
- 图像-文本检索
- 视频检索
- 近邻搜索
11.Image Segmentation(图像分割)
- Occlusion-Aware Seamless Segmentation
:star:code - SegVG: Transferring Object Bounding Box to Segmentation for Visual Grounding
:star:code
:Thumbsup:视觉定位新SOTA!SegVG:将视觉定位的目标边界框转化为分割信号(已开源) - Segment, Lift and Fit: Automatic 3D Shape Labeling from 2D Prompts
- Open-Vocabulary SAM: Segment and Recognize Twenty-thousand Classes Interactively
:star:code
:house:project - Segment and Recognize Anything at Any Granularity
:star:code - Enriching Information and Preserving Semantic Consistency in Expanding Curvilinear Object Segmentation Datasets
:star:code - Semi-supervised Segmentation of Histopathology Images with Noise-Aware Topological Consistency
:star:code - CoPT: Unsupervised Domain Adaptive Segmentation using Domain-Agnostic Text Embeddings
:star:code - From Pixels to Objects: A Hierarchical Approach for Part and Object Segmentation Using Local and Global Aggregation
- CC-SAM: Enhancing SAM with Cross-feature Attention and Context for Ultrasound Image Segmentation
- Unsupervised Moving Object Segmentation with Atmospheric Turbulence
- Lite-SAM Is Actually What You Need for Segment Everything
- Textual Query-Driven Mask Transformer for Domain Generalized Segmentation
:star:code - Can Textual Semantics Mitigate Sounding Object Segmentation Preference?
:star:code - RAPiD-Seg: Range-Aware Pointwise Distance Distribution Networks for 3D LiDAR Segmentation
:star:code - SPIN: Hierarchical Segmentation with Subpart Granularity in Natural Images
:star:code - CC-SAM: SAM with Cross-feature Attention and Context for Ultrasound Image Segmentation
- FlashSplat: 2D to 3D Gaussian Splatting Segmentation Solved Optimally
:star:code - SegGen: Supercharging Segmentation Models with Text2Mask and Mask2Img Synthesis
:house:project - PQ-SAM: Post-training Quantization for Segment Anything Model
- Active Coarse-to-Fine Segmentation of Moveable Parts from Real Images
:house:project - LiteSAM is Actually what you Need for segment Everything
- SEGIC: Unleashing the Emergent Correspondence for In-Context Segmentation
:star:code - A Semantic Space is Worth 256 Language Descriptions: Make Stronger Segmentation Models with Descriptive Properties
:star:code - Placing Objects in Context via Inpainting for Out-of-distribution Segmentation
:star:code - Rethinking and Improving Visual Prompt Selection for In-Context Learning Segmentation Framework
:star:code - Better Call SAL: Towards Learning to Segment Anything in Lidar
:star:code - 抠图
- 3D分割
- Bayesian Self-Training for Semi-Supervised 3D Segmentation
- Segment3D: Learning Fine-Grained Class-Agnostic 3D Segmentation without Manual Labels
:house:project - EgoLifter: Open-world 3D Segmentation for Egocentric Perception
:house:project
🤗huggingface - View-Consistent Hierarchical 3D Segmentation Using Ultrametric Feature Fields
:star:code
- 视频分割
- 实例分割
- Unleashing the Power of Prompt-driven Nucleus Instance Segmentation
:star:code - 3D实例分割
- Part2Object: Hierarchical Unsupervised 3D Instance Segmentation
:star:code - SAM-guided Graph Cut for 3D Instance Segmentation
:star:code
:house:project - Continual Learning and Unknown Object Discovery in 3D Scenes via Self-Distillation
:star:code实例分割 - OpenIns3D: Snap and Lookup for 3D Open-vocabulary Instance Segmentation
:star:code
- Part2Object: Hierarchical Unsupervised 3D Instance Segmentation
- 无监督实例分割
- 开发世界实例分割
- Unleashing the Power of Prompt-driven Nucleus Instance Segmentation
- 全景分割
- Open Panoramic Segmentation
:star:code - A Simple Latent Diffusion Approach for Panoptic Segmentation and Mask Inpainting
:star:code - Point-supervised Panoptic Segmentation via Estimating Pseudo Labels from Learnable Distance
- Strike a Balance in Continual Panoptic Segmentation
:star:code - 3D Open-Vocabulary Panoptic Segmentation with 2D-3D Vision-Language Distillation
- Open Panoramic Segmentation
- 语义分割
- Rethinking Data Augmentation for Robust LiDAR Semantic Segmentation in Adverse Weather
:star:code - Exploring Reliable Matching with Phase Enhancement for Night-time Semantic Segmentation
- MTA-CLIP: Language-Guided Semantic Segmentation with Mask-Text Alignment
- Sparse Refinement for Efficient High-Resolution Semantic Segmentation
:house:project - Embedding-Free Transformer with Inference Spatial Reduction for Efficient Semantic Segmentation
:star:code - Weakly Supervised Co-training with Swapping Assignments for Semantic Segmentation
:star:code - Context-Guided Spatial Feature Reconstruction for Efficient Semantic Segmentation
:star:code - On-the-fly Category Discovery for LiDAR Semantic Segmentation
:star:code - On the Viability of Monocular Depth Pre-training for Semantic Segmentation
- Efficient Active Domain Adaptation for Semantic Segmentation by Selecting Information-rich Superpixels
:star:code - Distributed Semantic Segmentation with Efficient Joint Source and Task Decoding
- FREST: Feature RESToration for Semantic Segmentation under Multiple Adverse Conditions
- Make a Strong Teacher with Label Assistance: A Novel Knowledge Distillation Approach for Semantic Segmentation
:star:code - MeshSegmenter: Zero-Shot Mesh Segmentation via Texture Synthesis
:star:code - Towards Reliable Evaluation and Fast Training of Robust Semantic Segmentation Models
:star:code - ItTakesTwo: Leveraging Peer Representations for Semi-supervised LiDAR Semantic Segmentation
:star:code - Evaluating the Adversarial Robustness of Semantic Segmentation: Trying Harder Pays Off
:star:code - Open-Vocabulary RGB-Thermal Semantic Segmentation
:star:code - Centering the Value of Every Modality: Towards Efficient and Resilient Modality-agnostic Semantic Segmentation
- Learning Modality-agnostic Representation for Semantic Segmentation from Any Modalities
- SFPNet: Sparse Focal Point Network for Semantic Segmentation on General LiDAR Point Clouds
:house:project
:star:code - Reliability in Semantic Segmentation: Can We Use Synthetic Data?
:star:code - Cs2K: Class-specific and Class-shared Knowledge Guidance for Incremental Semantic Segmentation
- MeshSegmenter: Zero-Shot Mesh Semantic Segmentation via Texture Synthesis
:star:code - 3D语义分割
- 跨域语义分割
- 无监督语义分割
- 半监督语义分割
- Beyond Pixels: Semi-Supervised Semantic Segmentation with a Multi-scale Patch-based Multi-Label Classifier
:star:code - Weighting Pseudo-Labels via High-Activation Feature Index Similarity and Object Detection for Semi-Supervised Segmentation
:star:code - SemiVL: Semi-Supervised Semantic Segmentation with Vision-Language Guidance
:star:code
- Beyond Pixels: Semi-Supervised Semantic Segmentation with a Multi-scale Patch-based Multi-Label Classifier
- 弱监督语义分割
- Knowledge Transfer with Simulated Inter-Image Erasing for Weakly Supervised Semantic Segmentation
:star:code
通过模拟图像间擦除实现知识转移,弱监督语义分割再也不怕过扩展问题,助力精准目标定位! - Tendency-driven Mutual Exclusivity for Weakly Supervised Incremental Semantic Segmentation
- DIAL: Dense Image-text ALignment for Weakly Supervised Semantic Segmentation
- Finding Meaning in Points: Weakly Supervised Semantic Segmentation for Event Cameras
:star:code - 3D weakly supervised semantic segmentation with 2D vision-language guidance
- Learning from the Web: Language Drives Weakly-Supervised Incremental Learning for Semantic Segmentation
- Phase Concentration and Shortcut Suppression for Weakly Supervised Semantic Segmentation
:star:code - DHR: Dual Features-Driven Hierarchical Rebalancing in Inter- and Intra-Class Regions for Weakly-Supervised Semantic Segmentation
:star:code - 3D Weakly Supervised Semantic Segmentation with 2D Vision-Language Guidance
:star:code - Diffusion-Guided Weakly Supervised Semantic Segmentation
:star:code
- Knowledge Transfer with Simulated Inter-Image Erasing for Weakly Supervised Semantic Segmentation
- 域适应语义分割
- 域泛化语义分割
- 类增量语义分割
- Background Adaptation with Residual Modeling for Exemplar-Free Class-Incremental Semantic Segmentation
:star:code - Mitigating Background Shift in Class-Incremental Semantic Segmentation
:star:code - Early Preparation Pays Off: New Classifier Pre-tuning for Class Incremental Semantic Segmentation
:star:code
- Background Adaptation with Residual Modeling for Exemplar-Free Class-Incremental Semantic Segmentation
- 零样本语义分割
- 开放词汇语义分割
- CLIP-DINOiser: Teaching CLIP a few DINO tricks for open-vocabulary semantic segmentation
:star:code
:house:project - Explore the Potential of CLIP for Training-Free Open Vocabulary Semantic Segmentation
:star:code - In Defense of Lazy Visual Grounding for Open-Vocabulary Semantic Segmentation
- ProxyCLIP: Proxy Attention Improves CLIP for Open-Vocabulary Segmentation
:star:code
- CLIP-DINOiser: Teaching CLIP a few DINO tricks for open-vocabulary semantic segmentation
- Rethinking Data Augmentation for Robust LiDAR Semantic Segmentation in Adverse Weather
- 部分分割
- 运动分割
- 烟雾分割
- 线段分割
- 场景解析
- 交互式分割
- 小样本分割
- 伪装目标分割
- 参考图像分割
- 指代图像分割
- 场景文本分割
- 开放词汇分割
- 指代表达式分割
- VIS
- VOS
- ActionVOS: Actions as Prompts for Video Object Segmentation
:star:code - VISA: Reasoning Video Object Segmentation via Large Language Model
:star:code - PanoVOS: Bridging Non-panoramic and Panoramic Views with Transformer for Video Segmentation
:star:code - Exploring Pre-trained Text-to-Video Diffusion Models for Referring Video Object Segmentation
:star:code - Betrayed by Attention: A Simple yet Effective Approach for Self-supervised Video Object Segmentation
:star:code - OneVOS: Unifying Video Object Segmentation with All-in-One Transformer Framework
- Spatial-Temporal Multi-level Association for Video Object Segmentation
- ActionVOS: Actions as Prompts for Video Object Segmentation
10.Image Classification(图像分类)
- Labeled Data Selection for Category Discovery
- Active Generation for Image Classification
:star:code - Meta-Prompting for Automating Zero-shot Visual Recognition with LLMs
:house:project - Dyn-Adapter: Towards Disentangled Representation for Efficient Visual Recognition
- Wavelet Convolutions for Large Receptive Fields
:star:code - Momentum Auxiliary Network for Supervised Local Learning
:star:code - An accurate detection is not all you need to combat label noise in web-noisy datasets
:star:code - Dual-stage Hyperspectral Image Classification Model with Spectral Supertoken
:star:code - DEPICT: Diffusion-Enabled Permutation Importance for Image Classification Tasks
- NOVUM: Neural Object Volumes for Robust Object Classification
:star:code - EntAugment: Entropy-Driven Adaptive Data Augmentation Framework for Image Classification
:star:code - Distribution-Aware Robust Learning from Long-Tailed Data with Noisy Labels
:star:code - Discovering Unwritten Visual Classifiers with Large Language Models
- 广义类别发现
- SelEx: Self-Expertise in Fine-Grained Generalized Category Discovery
:star:code - Textual Knowledge Matters: Cross-Modality Co-Teaching for Generalized Visual Class Discovery
:star:code广义类别发现(Generalized Category Discovery,GCD) - Learning to Distinguish Samples for Generalized Category Discovery
:star:code - PromptCCD: Learning Gaussian Mixture Prompt Pool for Continual Category Discovery
:star:code - Online Continuous Generalized Category Discovery
:star:code广义类别发现 - Category Adaptation Meets Projected Distillation in Generalized Continual Category Discovery
:star:code
- SelEx: Self-Expertise in Fine-Grained Generalized Category Discovery
- 多标签图像分类
- 小样本分类
- 零样本分类
- 多标签识别
- 长尾识别
- 细粒度
- On Learning Discriminative Features from Synthesized Data for Self-Supervised Fine-Grained Visual Recognition
- A Rotation-invariant Texture ViT for Fine-Grained Recognition of Esophageal Cancer Endoscopic Ultrasound Images
:star:code - Adapting Fine-Grained Cross-View Localization to Areas without Fine Ground Truth
9.Image Progress(图像/视频处理)
- ReNoise: Real Image Inversion Through Iterative Noising
:star:code
:house:project - UniProcessor: A Text-induced Unified Low-level Image Processor
- 恢复
- MoE-DiffIR: Task-customized Diffusion Priors for Universal Compressed Image Restoration
:star:code - Panel-Specific Degradation Representation for Raw Under-Display Camera Image Restoration
:star:code - Unsupervised Variational Translator for Bridging Image Restoration and High-Level Vision Tasks
- A Comparative Study of Image Restoration Networks for General Backbone Network Design
- GRIDS: Grouped Multiple-Degradation Restoration with Image Degradation Similarity
- Restoring Images in Adverse Weather Conditions via Histogram Transformer
- InstructIR: High-Quality Image Restoration Following Human Instructions
:star:code - Diffusion Prior-Based Amortized Variational Inference for Noisy Inverse Problems
:star:code - Towards Real-World Adverse Weather Image Restoration: Enhancing Clearness and Semantics with Vision-Language Models
- Teaching Tailored to Talent: Adverse Weather Restoration via Prompt Pool and Depth-Anything Constraint
- Seeing the Unseen: A Frequency Prompt Guided Transformer for Image Restoration
:star:code - MambaIR: A Simple Baseline for Image Restoration with State-Space Model
:star:code
:thumbsup:MambaIR: 基于Mamba的图像复原基准模型 - AutoDIR: Automatic All-in-One Image Restoration with Latent Diffusion
:star:code - SPIRE: Semantic Prompt-Driven Image Restoration
:house:project - Efficient Cascaded Multiscale Adaptive Network for Image Restoration
- Restore Anything with Masks: Leveraging Mask Image Modeling for Blind All-in-One Image Restoration
:star:code - When Fast Fourier Transform Meets Transformer for Image Restoration
:star:code - Osmosis: RGBD Diffusion Prior for Underwater Image Restoration
:house:project - Contribution-based Low-Rank Adaptation with Pre-training Model for Real Image Restoration
:house:project - DiffBIR: Toward Blind Image Restoration with Generative Diffusion Prior
:star:code - MetaWeather: Few-Shot Weather-Degraded Image Restoration
:star:code - Depth-Aware Blind Image Decomposition for Real-World Adverse Weather Recovery
- MoE-DiffIR: Task-customized Diffusion Priors for Universal Compressed Image Restoration
- 修补
- A Task is Worth One Word: Learning with Task Prompts for High-Quality Versatile Image Inpainting
:star:code
:house:project - Improving Text-guided Object Inpainting with Semantic Pre-inpainting
:star:code - BrushNet: A Plug-and-Play Image Inpainting Model with Decomposed Dual-Branch Diffusion
:house:project
- A Task is Worth One Word: Learning with Task Prompts for High-Quality Versatile Image Inpainting
- 去雨
- 去噪
- TTT-MIM: Test-Time Training with Masked Image Modeling for Denoising Distribution Shifts
:star:code - Region-Aware Sequence-to-Sequence Learning for Hyperspectral Denoising
:star:code - DualDn: Dual-domain Denoising via Differentiable ISP
:star:code - Exploiting Dual-Correlation for Multi-frame Time-of-Flight Denoising
:star:code - EDformer: Transformer-Based Event Denoising Across Varied Noise Levels
- denoiSplit: a method for joint microscopy image splitting and unsupervised denoising去噪
- Asymmetric Mask Scheme for Self-Supervised Real Image Denoising
:star:code - Switch Diffusion Transformer: Synergizing Denoising Tasks with Sparse Mixture-of-Experts
:house:project - Enhancing Plausibility Evaluation for Generated Designs with Denoising Autoencoder
:star:code
- TTT-MIM: Test-Time Training with Masked Image Modeling for Denoising Distribution Shifts
- 去雾
- 去模糊
- Deblur e-NeRF: NeRF from Motion-Blurred Events under High-speed or Low-light Conditions
:star:code - UniINR: Event-guided Unified Rolling Shutter Correction, Deblurring, and Interpolation
:star:code - Blind image deblurring with noise-robust kernel estimation
:star:code - Motion Aware Event Representation-driven Image Deblurring(https://github.com/ZhijingS/DA_event_deblur)
- BAD-Gaussians: Bundle Adjusted Deblur Gaussian Splatting
:star:code
- Deblur e-NeRF: NeRF from Motion-Blurred Events under High-speed or Low-light Conditions
- 去卷积
- 去反射
- 去伪影
- 去摩尔纹
- 去马赛克
- 目标移除
- 扩图
- 图像修饰
- 图像增强
- LightenDiffusion: Unsupervised Low-Light Image Enhancement with Latent-Retinex Diffusion Models
:star:code - LMT-GP: Combined Latent Mean-Teacher and Gaussian Process for Semi-supervised Low-light Image Enhancement
:star:code - RAVE: Residual Vector Embedding for CLIP-Guided Backlit Image Enhancement
:star:code - Image-adaptive 3D Lookup Tables for Real-time Image Enhancement with Bilateral Grids
:star:code - NamedCurves: Learned Image Enhancement via Color Naming
- Joint RGB-Spectral Decomposition Model Guided Image Enhancement in Mobile Photography
:star:code - GLARE: Low Light Image Enhancement via Generative Latent Feature based Codebook Retrieval
:star:code
:thumbsup:GLARE 利用外部正常光照先验,实现逼真的低光照增强效果! - Fast Context-Based Low-Light Image Enhancement via Neural Implicit Representations
:star:code - Unveiling Advanced Frequency Disentanglement Paradigm for Low-Light Image Enhancement
:star:code
- LightenDiffusion: Unsupervised Low-Light Image Enhancement with Latent-Retinex Diffusion Models
- 图像质量评估
- DSMix: Distortion-Induced Sensitivity Map Based Pre-training for No-Reference Image Quality Assessment
:star:code - A Comprehensive Study of Multimodal Large Language Models for Image Quality Assessment
- Towards Open-ended Visual Quality Comparison
:star:code
:Thumbsup:Co-Instruct: 让通用多模态大模型学会比较视觉质量 - PromptIQA: Boosting the Performance and Generalization for No-Reference Image Quality Assessment via Prompts无参考图像质量评估
- Assessing UHD Image Quality from Aesthetics, Distortions, and Saliency
:star:code - Depicting Beyond Scores: Advancing Image Quality Assessment through Multi-modal Language Models
:house:project - DSMix: Distortion-Induced Saliency Map Based Pre-training for No-Reference Image Quality Assessment
:star:code
- DSMix: Distortion-Induced Sensitivity Map Based Pre-training for No-Reference Image Quality Assessment
- 图像美学质量评价
- 视频恢复
- 视频着色
- 视频增强
- 视频去雨
- 视频去噪
- 视频去雪
- 视频去模糊
- Domain-adaptive Video Deblurring via Test-time Blurring
:star:code - CMTA: Cross-Modal Temporal Alignment for Event-guided Video Deblurring
:star:code - Cross-Modal Temporal Alignment for Event-guided Video Deblurring
:star:code - Towards Real-world Event-guided Low-light Video Enhancement and Deblurring
:star:code - Rethinking Video Deblurring with Wavelet-Aware Dynamic Transformer and Diffusion Model
:star:code
- Domain-adaptive Video Deblurring via Test-time Blurring
- 视频去闪烁
- 视频去马赛克
- 视频质量增强
- 重照明
8.Super-Resolution(超分辨率)
- Data Overfitting for On-Device Super-Resolution with Dynamic Algorithm and Compiler Co-Design
:star:code - SMFANet: A Lightweight Self-Modulation Feature Aggregation Network for Efficient Image Super-Resolution
:star:code - Towards Robust Full Low-bit Quantization of Super Resolution Networks
- BurstM: Deep Burst Multi-scale SR using Fourier Space with Optical Flow
:star:code - HiT-SR: Hierarchical Transformer for Efficient Image Super-Resolution
- Pairwise Distance Distillation for Unsupervised Real-World Image Super-Resolution
:star:code - UCIP: A Universal Framework for Compressed Image Super-Resolution using Dynamic Prompt
:star:code - Accelerating Image Super-Resolution Networks with Pixel-Level Classification
:star:code - Rethinking Image Super-Resolution from Training Data Perspectives
- Spatially-Variant Degradation Model for Dataset-free Super-resolution
:star:code - Learning Exhaustive Correlation for Spectral Super-Resolution: Where Spatial-Spectral Attention Meets Linear Dependence
- Contourlet Residual for Prompt Learning Enhanced Infrared Image Super-Resolution
:star:code - Confidence-Based Iterative Generation for Real-World Image Super-Resolution
:star:code - Adaptive Multi-modal Fusion of Spatially Variant Kernel Refinement with Diffusion Model for Blind Image Super-Resolution
- Pixel-Aware Stable Diffusion for Realistic Image Super-Resolution and Personalized Stylization
:star:code - XPSR: Cross-modal Priors for Diffusion-based Image Super-Resolution
:star:code - AdaDiffSR: Adaptive Region-aware Dynamic acceleration Diffusion Model for Real-World Image Super-Resolution
- Overcoming Distribution Mismatch in Quantizing Image Super-Resolution Networks
:star:code - Rethinking Image Super Resolution from Training Data Perspectives
:star:code - MTKD: Multi-Teacher Knowledge Distillation for Image Super-Resolution
:star:code - OmniSSR: Zero-shot Omnidirectional Image Super-Resolution using Stable Diffusion Model
- Learning Dual-Level Deformable Implicit Representation for Real-World Scale Arbitrary Super-Resolution
:star:code - You Only Need One Step: Fast Super-Resolution with Stable Diffusion via Scale Distillation
:star:code - A New Dataset and Framework for Real-World Blurred Images Super-Resolution
:star:code - 场景文本图像超分辨率
- VSR
- Arbitrary-Scale Video Super-Resolution with Structural and Textural Priors
:star:code - RealViformer: Investigating Attention for Real-World Video Super-Resolution
:star:code - Enhancing Perceptual Quality in Video Super-Resolution through Temporally-Consistent Detail Synthesis using Diffusion Models
:star:code - SuperGaussian: Repurposing Video Models for 3D Super Resolution
:house:project - Event-Adapted Video Super-Resolution
- Motion-Guided Latent Diffusion for Temporally Consistent Real-world Video Super-resolution
:star:code
- Arbitrary-Scale Video Super-Resolution with Structural and Textural Priors
7.Object Detection(目标检测)
- Can OOD Object Detectors Learn from Foundation Models?
:star:code - Crowd-SAM:SAM as a smart annotator for object detection in crowded scenes
:star:code - Distilling Knowledge from Large-Scale Image Models for Object Detection
- DeTra: A Unified Model for Object Detection and Trajectory Forecasting
- Modality Translation for Object Detection Adaptation without forgetting prior knowledge
- OpenSight: A Simple Open-Vocabulary Framework for LiDAR-Based Object Detection
- LEROjD: Lidar Extended Radar-Only Object Detection
:star:code - Bucketed Ranking-based Losses for Efficient Training of Object Detectors
:star:code - Plain-Det: A Plain Multi-Dataset Object Detector
:star:code - On Calibration of Object Detectors: Pitfalls, Evaluation and Baselines
:star:code - Towards Open-World Object-based Anomaly Detection via Self-Supervised Outlier Synthesis
:star:code - Weak-to-Strong Compositional Learning from Generative Models for Language-based Object Detection
- PartGLEE: A Foundation Model for Recognizing and Parsing Any Objects
:star:code - Crowd-SAM: SAM as a Smart Annotator for Object Detection in Crowded Scenes
:star:code - Relation DETR: Exploring Explicit Position Relation Prior for Object Detection
:star:code - Bridge Past and Future: Overcoming Information Asymmetry in Incremental Object Detection
:star:code - Modality Translation for Object Detection Adaptation Without Forgetting Prior Knowledge
:star:code - T-Rex2: Towards Generic Object Detection via Text-Visual Prompt Synergy
:star:code - Fine-grained Dynamic Network for Generic Event Boundary Detection
:star:code - CSOT: Cross-Scan Object Transfer for Semi-Supervised LiDAR Object Detection
- Bayesian Detector Combination for Object Detection with Crowdsourced Annotations
:star:code - Simplifying Source-Free Domain Adaptation for Object Detection: Effective Self-Training Strategies and Performance Insights
:star:code - Out-of-Bounding-Box Triggers: A Stealthy Approach to Cheat Object Detectors
:star:code - A Simple Background Augmentation Method for Object Detection with Diffusion Model
- Look Around and Learn: Self-Training Object Detection by Exploration
:star:code - Co-Student: Collaborating Strong and Weak Students for Sparsely Annotated Object Detection
:star:code - Benchmarking Object Detectors with COCO: A New Path Forward
:sunflower:dataset - DAMSDet: Dynamic Adaptive Multispectral Detection Transformer with Competitive Query Selection and Adaptive Feature Fusion
:star:code目标检测 - Integer-Valued Training and Spike-driven Inference Spiking Neural Network for High-performance and Energy-efficient Object Detection
:star:code - YOLOv9: Learning What You Want to Learn Using Programmable Gradient Information
:star:code - Equivariant Spatio-Temporal Self-Supervision for LiDAR Object Detection
- Projecting Points to Axes: Oriented Object Detection via Point-Axis Representation
- GRA: Detecting Oriented Objects through Group-wise Rotating and Attention
- Embracing Events and Frames with Hierarchical Feature Refinement Network for Object Detection
:star:code - Dynamic Retraining-Updating Mean Teacher for Source-Free Object Detection
:star:code - Zero-Shot Detection of AI-Generated Images
:star:code
:house:project - MOD-UV: Learning Mobile Object Detectors from Unlabeled Videos
:star:code - Category-level Object Detection, Pose Estimation and Reconstruction from Stereo Images
:house:project - Rethinking Features-Fused-Pyramid-Neck for Object Detection
:star:code - 3D目标检测
- Make Your ViT-based Multi-view 3D Detectors Faster via Token Compression
:star:code - Approaching Outside: Scaling Unsupervised 3D Object Detection from 2D Scene
- SparseLIF: High-Performance Sparse LiDAR-Camera Fusion for 3D Object Detection
- MonoWAD: Weather-Adaptive Diffusion Model for Robust Monocular 3D Object Detection
:star:code - Transfer Learning from Simulated to Real Scenes for Monocular 3D Object Detection
:star:code - Domain Generalization of 3D Object Detection by Density-Resampling
:star:code - Learning High-resolution Vector Representation from Multi-Camera Images for 3D Object Detection
:star:code - LiDAR-based All-weather 3D Object Detection via Prompting and Distilling 4D Radar
- Towards Stable 3D Object Detection
- RecurrentBEV: A Long-term Temporal Fusion Framework for Multi-view 3D Detection
:star:code - LISO: Lidar-only Self-Supervised 3D Object Detection
:star:code - Diffusion Model for Robust Multi-Sensor Fusion in 3D Object Detection and BEV Segmentation
:star:code - SAMFusion: Sensor-Adaptive Multimodal Fusion for 3D Object Detection in Adverse Weather
- Unlocking Textual and Visual Wisdom: Open-Vocabulary 3D Object Detection Enhanced by Comprehensive Guidance from Text and Image
- OV-Uni3DETR: Towards Unified Open-Vocabulary 3D Object Detection via Cycle-Modality Propagation
:star:code - Better Regression Makes Better Test-time Adaptive 3D Object Detection
- Find n' Propagate: Open-Vocabulary 3D Object Detection in Urban Environments
:star:code - Diff3DETR: Agent-based Diffusion Model for Semi-supervised 3D Object Detection
- Weakly Supervised 3D Object Detection via Multi-Level Visual Guidance
:star:code - SimPB: A Single Model for 2D and 3D Object Detection from Multiple Cameras
:star:code - CMD: A Cross Mechanism Domain Adaptation Dataset for 3D Object Detection
:star:code
:thumbsup:DIG从密度、强度和几何三方面缓和传感器体制带来的点云数据差异,显著提升了域自适应算法的性能。 - Ray Denoising: Depth-aware Hard Negative Sampling for Multi-view 3D Object Detection
:star:code - OPEN: Object-wise Position Embedding for Multi-view 3D Object Detection
:star:code - LabelDistill: Label-guided Cross-modal Knowledge Distillation for Camera-based 3D Object Detection
:star:code - MonoTTA: Fully Test-Time Adaptation for Monocular 3D Object Detection
:star:code - FSD-BEV: Foreground Self-Distillation for Multi-view 3D Object Detection
:star:code - General Geometry-aware Weakly Supervised 3D Object Detection
:star:code - Interactive 3D Object Detection with Prompts
- Beyond Viewpoint: Robust 3D Object Recognition under Arbitrary Views through Joint Multi-Part Representation
- Detecting As Labeling: Rethinking LiDAR-camera Fusion in 3D Object Detection
:star:code - TCC-Det: Temporarily consistent cues for weakly-supervised 3D detection
:star:code - GraphBEV: Towards Robust BEV Feature Alignment for Multi-Modal 3D Object Detection
:star:code
- Make Your ViT-based Multi-view 3D Detectors Faster via Token Compression
- 小目标检测
- IRSAM: Advancing Segment Anything Model for Infrared Small Target Detection
- Visible and Clear: Finding Tiny Objects in Difference Map
- DQ-DETR: DETR with Dynamic Query for Tiny Object Detection
:star:code - 3D Small Object Detection with Dynamic Spatial Pruning
:star:code
:thumbsup:DSPDet3D:基于动态空间剪枝的高效率3D小目标检测
- 伪装目标检测
- CamoTeacher: Dual-Rotation Consistency Learning for Semi-Supervised Camouflaged Object Detection
:thumbsup:有效减少了像素级、实例级噪声问题 - Learning Camouflaged Object Detection from Noisy Pseudo Label
:star:code - Just a Hint: Point-Supervised Camouflaged Object Detection
- SAM-COD: SAM-guided Unified Framework for Weakly-Supervised Camouflaged Object Detection
- Frequency-Spatial Entanglement Learning for Camouflaged Object Detection
:star:code - FocusDiffuser: Perceiving Local Disparities for Camouflaged Object Detection
:star:code
- CamoTeacher: Dual-Rotation Consistency Learning for Semi-Supervised Camouflaged Object Detection
- 长尾目标检测
- 显著目标检测
- 域适应目标检测
- 小样本目标检测
- SMILe: Leveraging Submodular Mutual Information For Robust Few-Shot Object Detection
:house:project - Adaptive Multi-task Learning for Few-shot Object Detection
:star:code - Cross-Domain Few-Shot Object Detection via Enhanced Open-Set Object Detector
:house:project
:thumbsup:跨域小样本物体检测CD-FSOD新数据集、CD-ViTO新方法(数据代码均已开源)
- SMILe: Leveraging Submodular Mutual Information For Robust Few-Shot Object Detection
- 共同显著目标检测
- 开放词汇目标检测
- Global-Local Collaborative Inference with LLM for Lidar-Based Open-Vocabulary Detection
:star:code - LaMI-DETR: Open-Vocabulary Detection with Language Model Instruction
- MarvelOVD: Marrying Object Recognition and Vision-Language Models for Robust Open-Vocabulary Object Detection
:star:code - Region-centric Image-Language Pretraining for Open-Vocabulary Detection
:star:code - CLIFF: Continual Latent Diffusion for Open-Vocabulary Object Detection
:star:code
- Global-Local Collaborative Inference with LLM for Lidar-Based Open-Vocabulary Detection
- 水印检测
- 阴影检测
- 开集识别
- 目标定位
6.Object Tracking(目标跟踪)
- Local All-Pair Correspondence for Point Tracking
:star:code
:star:code - Track Everything Everywhere Fast and Robustly
:star:code
:house:project - CoTracker: It is Better to Track Together
:house:project - DINO-Tracker: Taming DINO for Self-Supervised Point Tracking in a Single Video
:star:code
:house:project - Decomposition Betters Tracking Everything Everywhere
:star:code - Self-Supervised Any-Point Tracking by Contrastive Random Walks
:house:project
:star:code - TAPTR: Tracking Any Point with Transformers as Detection
:star:code - MapTracker: Tracking with Strided Memory Fusion for Consistent Vector HD Mapping
:house:project - SPAMming Labels: Efficient Annotations for the Trackers of Tomorrow
- SLAck: Semantic, Location, and Appearance Aware Open-Vocabulary Tracking
:star:code - OneTrack: Demystifying the Conflict Between Detection and Tracking in End-to-End 3D Trackers
- Exploring the Feature Extraction and Relation Modeling For Light-Weight Transformer Tracking
:star:code - Empowering Embodied Visual Tracking with Visual Foundation Models and Offline RL
- 3D目标跟踪
- 多目标跟踪
- Lost and Found: Overcoming Detector Failures in Online Multi-Object Tracking
:star:code - Walker: Self-supervised Multiple Object Tracking by Walking on Temporal Object Appearance Graphs
- Beyond MOT: Semantic Multi-Object Tracking
:star:code - PapMOT: Exploring Adversarial Patch Attack against Multiple Object Tracking
- VETRA: A Dataset for Vehicle Tracking in Aerial Imagery - New Challenges for Multi-Object Tracking
:house:project
- Lost and Found: Overcoming Detector Failures in Online Multi-Object Tracking
- 细胞跟踪
5.OCR
- Parrot Captions Teach CLIP to Spot Text
:star:code - WeCromCL: Weakly Supervised Cross-Modality Contrastive Learning for Transcription-only Supervised Text Spotting
- FineMatch: Aspect-based Fine-grained Image and Text Mismatch Detection and Correction
:house:project - Bridging Synthetic and Real Worlds for Pre-training Scene Text Detectors
:star:code - 手写文本检测
- Align, Minimize and Diversify: A Source-Free Unsupervised Domain Adaptation Method for Handwritten Text Recognition
- PosFormer: Recognizing Complex Handwritten Mathematical Expression with Position Forest Transformer
:star:code
:Thumbsup:上交推出 PosFormer!优化位置识别任务来辅助表达式识别,复杂公式识别能力再创新SOTA! - Elegantly Written: Disentangling Writer and Character Styles for Enhancing Online Chinese Handwriting
- NAMER: Non-Autoregressive Modeling for Handwritten Mathematical Expression Recognition
- 手写文本合成
- 场景文本删除
- 文档理解
- 文本分割
- 文本合成
- 文本修复
4.Pose(姿态估计)
- X-Pose: Detecting Any Keypoints
:star:code - VQ-HPS: Human Pose and Shape Estimation in a Vector-Quantized Latent Space
:house:project - Expressive Whole-Body 3D Gaussian Avatar
:star:code - GTPT: Group-based Token Pruning Transformer for Efficient Human Pose Estimation
- PoseSOR: Human Pose Can Guide Our Attention
:star:code - COSMU: Complete 3D human shape from monocular unconstrained images
- Modeling and Driving Human Body Soundfields through Acoustic Primitives
:house:project - Domain-Adaptive 2D Human Pose Estimation via Dual Teachers in Extremely Low-Light Conditions
:star:code - SpaRP: Fast 3D Object Reconstruction and Pose Estimation from Sparse Views
- PoseEmbroider: Towards a 3D, Visual, Semantic-aware Human Pose Representation
- PoseAugment: Generative Human Pose Data Augmentation with Physical Plausibility for IMU-based Motion Capture
:star:code - HPE-Li: WiFi-enabled Lightweight Dual Selective Kernel Convolution for Human Pose Estimation人体姿势估计
- EgoPoser: Robust Real-Time Egocentric Pose Estimation from Sparse and Intermittent Observations Everywhere
:house:project - You Only Learn One Query: Learning Unified Human Query for Single-Stage Multi-Person Multi-Task Human-Centric Perception
:star:code - Within the Dynamic Context: Inertia-aware 3D Human Modeling with Pose Sequence
- Benchmarks and Challenges in Pose Estimation for Egocentric Hand Interactions with Objects
- Human Pose Recognition via Occlusion-Preserving Abstract Images
- 文本驱动的人体生成
- 多人姿势预测
- 3D人体姿态估计
- MPL: Lifting 3D Human Pose from Multi-view 2D Poses
:star:code - RT-Pose: A 4D Radar Tensor-based 3D Human Pose Estimation and Localization Benchmark
:house:project - AvatarPose: Avatar-guided 3D Pose Estimation of Close Human Interaction from Sparse Multi-view Videos
:star:code
:house:project - Mask as Supervision: Leveraging Unified Mask Information for Unsupervised 3D Pose Estimation
:star:code - 3D Human Pose Estimation via Non-Causal Retentive Networks
:star:code - UPose3D: Uncertainty-Aware 3D Human Pose Estimation with Cross-View and Temporal Cues
- Occlusion Handling in 3D Human Pose Estimation with Perturbed Positional Encoding
- RePOSE: 3D Human Pose Estimation via Spatio-Temporal Depth Relational Consistency
:star:code - RT-Pose: A 4D Radar-Tensor based 3D Human Pose Estimation and Localization Benchmark
🤗huggingface - EgoPoseFormer: A Simple Baseline for Stereo Egocentric 3D Human Pose Estimation
:star:code - 3DSA:Multi-View 3D Human Pose Estimation With 3D Space Attention Mechanisms
- WorldPose: A World Cup Dataset for Global 3D Human Pose Estimation
- Rotated Orthographic Projection for Self-Supervised 3D Human Pose Estimation
- NICP: Neural ICP for 3D Human Registration at Scale
:house:project
- MPL: Lifting 3D Human Pose from Multi-view 2D Poses
- 人体网格恢复
- Multi-HMR: Multi-Person Whole-Body Human Mesh Recovery in a Single Shot
:star:code - Divide and Fuse: Body Part Mesh Recovery from Partially Visible Human Images
- Global-to-Pixel Regression for Human Mesh Recovery
- WindPoly: Polygonal Mesh Reconstruction via Winding Numbers
:house:project - Multi-RoI Human Mesh Recovery with Camera Consistency and Contrastive Losses
:star:code
- Multi-HMR: Multi-Person Whole-Body Human Mesh Recovery in a Single Shot
- 3D人体纹理生成
- 3D人体生成
- StructLDM: Structured Latent Diffusion for 3D Human Generation
:star:code
:house:project
:thumbsup:南洋理工三维数字人生成新范式:结构扩散模型 - Text Motion Translator: A Bi-Directional Model for Enhanced 3D Human Motion Generation from Open-Vocabulary Descriptions
- Text to Layer-wise 3D Clothed Human Generation
:house:project - SemanticHuman-HD: High Resolution Semantic disentangled 3D Human Generation
:house:project3D 人类生成
- StructLDM: Structured Latent Diffusion for 3D Human Generation
- 人体重建
- 动作捕捉
- 手语识别
- 手部网格
- 3D手部序列恢复
- 3D手部重建
- 手部姿态估计
- 手部运动预测
- 头部姿态估计
- 手持物体重建
- 头部姿态估计
- 4D 头部捕获
- 动作捕捉
- 手语视频生成
3.Face(人脸)
- Task-adaptive Q-Face
- Faceptor: A Generalist Model for Face Perception
:star:code - A Light Stage on Every Desk
:house:project - Face Adapter for Pre-Trained Diffusion Models with Fine-Grained ID and Attribute Control
:star:code
:house:project - ReSyncer: Rewiring Style-based Generator for Unified Audio-Visually Synced Facial Performer
:star:code - Facial Affective Behavior Analysis with Instruction Tuning
:star:code
:house:project - Arc2Face: A Foundation Model for ID-Consistent Human Faces
:star:code
:house:project - GAMMA-FACE: GAussian Mixture Models Amend Diffusion Models for Bias Mitigation in Face Images
- GRAPE: Generalizable and Robust Multi-view Facial Capture
- High-Quality Mesh Blendshape Generation from Face Videos via Neural Inverse Rendering
:star:code - 人脸交换
- 人脸模糊
- 人脸识别
- Towards Certifiably Robust Face Recognition
- AdaDistill: Adaptive Knowledge Distillation for Deep Face Recognition
:star:code - ARoFace: Alignment Robustness to Improve Low-Quality Face Recognition
:star:code - Personalized Privacy Protection Mask Against Unauthorized Facial Recognition
- MST-KD: Multiple Specialized Teachers Knowledge Distillation for Fair Face Recognition
- dversariaLeak: External Information Leakage Attack Using Adversarial Samples on Face Recognition Systems
- 人脸聚类
- 人脸重建
- 人脸表情
- 人脸编辑
- 人脸动画
- 说话头合成
- ScanTalk: 3D Talking Heads from Unregistered Scans
:star:code - EDTalk: Efficient Disentanglement for Emotional Talking Head Synthesis
:house:project头部合成 - EmoTalk3D: High-Fidelity Free-View Synthesis of Emotional 3D Talking Head
:star:code - All You Need is Your Voice: Emotional Face Representation with Audio Perspective for Emotional Talking Face Generation
:star:code - Audio-driven Talking Face Generation with Stabilized Synchronization Loss
- Head360: Learning a Parametric 3D Full-Head for Free-View Synthesis in 360°
:star:code - S^3D-NeRF: Single-Shot Speech-Driven Neural Radiance Field for High Fidelity Talking Head Synthesis
- Gaussian3Diff: 3D Gaussian Diffusion for 3D Full Head Synthesis and Editing
:house:project头部合成 - Tri2-plane: Thinking Head Avatar via Feature Pyramid
:house:project - Avatar Fingerprinting for Authorized Use of Synthetic Talking-Head Videos
:house:project - TalkingGaussian: Structure-Persistent 3D Talking Head Synthesis via Gaussian Splatting
:star:code3D 说话头合成
- ScanTalk: 3D Talking Heads from Unregistered Scans
- 动画头部头像
- 人脸超分辨
- 人脸活体检测
- TF-FAS: Twofold-Element Fine-Grained Semantic Guidance for Generalizable Face Anti-Spoofing
:star:code
:thumbsup:通过双重元素细粒度语义指导来增强泛化能力 - DiffFAS: Face Anti-Spoofing via Generative Diffusion Models
:star:code - Towards Unified Representation of Invariant-Specific Features in Missing Modality Face Anti-Spoofing
- Bottom-Up Domain Prompt Tuning for Generalized Face Anti-Spoofing
- TF-FAS: Twofold-Element Fine-Grained Semantic Guidance for Generalizable Face Anti-Spoofing
- 头部合成
- 情绪识别
- 人脸动作单元检测
- 假脸检测
2.3D Visual
- GroundUp: Rapid Sketch-Based 3D City Massing
:house:project - Ray-Distance Volume Rendering for Neural Scene Reconstruction
- HSR: Holistic 3D Human-Scene Reconstruction from Monocular Videos
:house:project - Decomposition of Neural Discrete Representations for Large-Scale 3D Mapping
:star:code - BlenderAlchemy: Editing 3D Graphics with Vision-Language Models
:house:project - Temporal Event Stereo via Joint Learning with Stereoscopic Flow
:star:code - GenRC: Generative 3D Room Completion from Sparse Image Collections
:star:code - Single-Photon 3D Imaging with Equi-Depth Photon Histograms
:house:project - Viewpoint textual inversion: discovering scene representations and 3D view control in 2D diffusion models
:star:code - SparseCraft: Few-Shot Neural Reconstruction through Stereopsis Guided Geometric Linearization
:star:code - 3D Congealing: 3D-Aware Image Alignment in the Wild
:house:project - ClusteringSDF: Self-Organized Neural Implicit Surfaces for 3D Decomposition
:house:project - BAGS: Blur Agnostic Gaussian Splatting through Multi-Scale Kernel Modeling
:star:code - Soft Shadow Diffusion (SSD): Physics-inspired Learning for 3D Computational Periscopy
- DPA-Net: Structured 3D Abstraction from Sparse Views via Differentiable Primitive Assembly
- An Optimization Framework to Enforce Multi-View Consistency for Texturing 3D Meshes
:house:project - Omniview-Tuning: Boosting Viewpoint Invariance of Vision-Language Pre-training Models
:house:project - Diffusion Model is a Good Pose Estimator from 3D RF-Vision
:house:project - Nuvo: Neural UV Mapping for Unruly 3D Representations
:house:project - MAP-ADAPT: Real-Time Quality-Adaptive Semantic 3D Maps
:house:project - MinD-3D: Reconstruct High-quality 3D objects in Human Brain
:house:project3D - UpFusion: Novel View Diffusion from Unposed Sparse View Observations
:star:code3D - MVS
- 3D Visual Grounding
- Empowering 3D Visual Grounding with Reasoning Capabilities
:house:project - Multi-branch Collaborative Learning Network for 3D Visual Grounding
:star:code
:thumbsup:3DREC的Acc@0.5提高了 3.27%,3DRES的mIOU 提高了5.22% - ScanReason: Empowering 3D Visual Grounding with Reasoning Capabilities
:star:code - Four Ways to Improve Verbo-visual Fusion for Dense 3D Visual Grounding
:house:project
- Empowering 3D Visual Grounding with Reasoning Capabilities
- Stereo Matching
- 3DGS
- GaussReg: Fast 3D Registration with Gaussian Splatting
- 3iGS: Factorised Tensorial Illumination for 3D Gaussian Splatting
- Texture-GS: Disentangle the Geometry and Texture for 3D Gaussian Splatting Editing
:star:code - Compact3D: Smaller and Faster Gaussian Splatting with Vector Quantization
:star:code
:house:project - CoR-GS: Sparse-View 3D Gaussian Splatting via Co-Regularization
:star:code
:house:project3DGS - End-to-End Rate-Distortion Optimized 3D Gaussian Representation
- Deblurring 3D Gaussian Splatting
:star:code
:house:project - Per-Gaussian Embedding-Based Deformation for Deformable 3D Gaussian Splatting
:house:project - HAC: Hash-grid Assisted Context for 3D Gaussian Splatting Compression
:house:project - On the Error Analysis of 3D Gaussian Splatting and an Optimal Projection Strategy
:star:code - Analytic-Splatting: Anti-Aliased 3D Gaussian Splatting via Analytic Integration
:star:code - MVSGaussian: Fast Generalizable Gaussian Splatting Reconstruction from Multi-View Stereo
:star:code
:house:project - Street Gaussians: Modeling Dynamic Urban Scenes with Gaussian Splatting
:star:code - DGD: Dynamic 3D Gaussians Distillation
:house:project - EAGLES: Efficient Accelerated 3D Gaussians with Lightweight EncodingS
:star:code - Revising Densification in Gaussian Splatting
- HO-Gaussian: Hybrid Optimization of 3D Gaussian Splatting for Urban Scenes
- RoDUS: Robust Decomposition of Static and Dynamic Elements in Urban Scenes
- SWinGS: Sliding Windows for Dynamic 3D Gaussian Splatting
- VEGS: View Extrapolation of Urban Scenes in 3D Gaussian Splatting using Learned Priors
:star:code - MesonGS: Post-training Compression of 3D Gaussians via Efficient Attribute Transformation
- MIGS: Multi-Identity Gaussian Splatting via Tensor Decomposition
:star:code - SAGS: Structure-Aware 3D Gaussian Splatting
:house:project - GGRt: Towards Generalizable 3D Gaussians without Pose Priors in Real-Time
:house:project - Gaussian in the wild: 3D Gaussian Splatting for Unconstrained Image Collections
:star:code - Pixel-GS Density Control with Pixel-aware Gradient for 3D Gaussian Splatting
:star:code - WaSt-3D: Wasserstein-2 Distance for Scene-to-Scene Stylization on 3D Gaussians
:house:project - MVSplat: Efficient 3D Gaussian Splatting from Sparse Multi-View Images
:star:code - GS-LRM: Large Reconstruction Model for 3D Gaussian Splatting
:house:project - DynMF: Neural Motion Factorization for Real-time Dynamic View Synthesis with 3D Gaussian Splatting
:star:code
- 深度估计
- Leveraging Near-Field Lighting for Monocular Depth Estimation from Endoscopy Videos
:star:code
:house:project - PatchRefiner: Leveraging Synthetic Data for Real-Domain High-Resolution Monocular Metric Depth Estimation
:star:code - Remove Projective LiDAR Depthmap Artifacts via Exploiting Epipolar Geometry深度图伪影
- Revisit Self-supervision with Local Structure-from-Motion
- DoubleTake: Geometry Guided Depth Estimation
:star:code - Physics-informed Knowledge Transfer for Underwater Monocular Depth Estimation
- FutureDepth: Learning to Predict the Future Improves Video Depth Estimation
- ProDepth: Boosting Self-Supervised Multi-Frame Monocular Depth with Probabilistic Fusion
:star:code - Mono-ViFI: A Unified Learning Framework for Self-supervised Single- and Multi-frame Monocular Depth Estimation
:star:code - Diffusion Models for Monocular Depth Estimation: Overcoming Challenging Conditions
:star:code
:star:code - High-Precision Self-Supervised Monocular Depth Estimation with Rich-Resource Prior
- DiffusionDepth: Diffusion Denoising Approach for Monocular Depth Estimation
:star:code - SEDiff: Structure Extraction for Domain Adaptive Depth Estimation via Denoising Diffusion Models
- Camera Height Doesn't Change: Unsupervised Training for Metric Monocular Road-Scene Depth Estimation
:house:project - GroCo: Ground Constraint for Metric Self-Supervised Monocular Depth
:star:code深度估计 - M2Depth: Self-supervised Two-Frame Multi-camera Metric Depth Estimation
:house:project - Improving Domain Generalization in Self-Supervised Monocular Depth Estimation via Stabilized Adversarial Training
- Leveraging Near-Field Lighting for Monocular Depth Estimation from Endoscopy Videos
- 深度补全
- Deep Cost Ray Fusion for Sparse Depth Video Completion
- OGNI-DC: Robust Depth Completion with Optimization-Guided Neural Iterations
:star:code - AugUndo: Scaling Up Augmentations for Monocular Depth Completion and Estimation
:star:code - Sparse Beats Dense: Rethinking Supervision in Radar-Camera Depth Completion
:star:code
- 表面重建
- Surface Reconstruction from Gaussian Splatting via Novel Stereo Views
:house:project - SG-NeRF: Neural Surface Reconstruction with Scene Graph Optimization
:star:code - Sur2f: A Hybrid Representation for High-Quality and Efficient Surface Reconstruction from Multi-view Images
- Improving Neural Surface Reconstruction with Feature Priors from Multi-View Image
- DiffSurf: A Transformer-based Diffusion Model for Generating and Reconstructing 3D Surfaces in Pose
- Rethinking Directional Parameterization in Neural Implicit Surface Reconstruction
- PISR: Polarimetric Neural Implicit Surface Reconstruction for Textureless and Specular Objects
:star:code - Surface Reconstruction for 3D Gaussian Splatting via Local Structural Hints
:house:project - EMIE-MAP: Large-Scale Road Surface Reconstruction Based on Explicit Mesh and Implicit Encoding
- GS2Mesh: Surface Reconstruction from Gaussian Splatting via Novel Stereo Views
:house:project - Improving Neural Surface Reconstruction with Feature Priors from Multi-View Images
:star:code - DynoSurf: Neural Deformation-based Temporally Consistent Dynamic Surface Reconstruction
:star:code - Surface-Centric Modeling for High-Fidelity Generalizable Neural Surface Reconstruction
:star:code - Parameterization-driven Neural Surface Reconstruction for Object-oriented Editing in Neural Rendering
:house:project
- Surface Reconstruction from Gaussian Splatting via Novel Stereo Views
- 三维重建
- GSD: View-Guided Gaussian Splatting Diffusion for 3D Reconstruction
:house:project - InfoNorm: Mutual Information Shaping of Normals for Sparse-View Reconstruction
:star:code - fMRI-3D: A Comprehensive Dataset for Enhancing fMRI-based 3D Reconstruction
:star:code
:house:project
:house:project - Reconstruction and Simulation of Elastic Objects with Spring-Mass 3D Gaussians
:star:code - GRM: Large Gaussian Reconstruction Model for Efficient 3D Reconstruction and Generation
:star:code
:house:project - latentSplat: Autoencoding Variational Gaussians for Fast Generalizable 3D Reconstruction
:star:code - MirrorGaussian: Reflecting 3D Gaussians for Reconstructing Mirror Reflections
:house:project - Resolving Scale Ambiguity in Multi-view 3D Reconstruction using Dual-Pixel Sensors
:star:code - 3D Reconstruction of Objects in Hands without Real World 3D Supervision
- Human Hair Reconstruction with Strand-Aligned 3D Gaussians
:house:project - SUP-NeRF: A Streamlined Unification of Pose Estimation and NeRF for Monocular 3D Object Reconstruction
:star:code - MVDiffHD: A Dense High-resolution Multi-view Diffusion Model for Single or Sparse-view 3D Object Reconstruction(https://github.com/Tangshitao/MVDiffusion_plusplus)
- NeuSDFusion: A Spatial-Aware Generative Model for 3D Shape Completion, Reconstruction, and Generation
:house:project - Sketch2Vox: Learning 3D Reconstruction from a Single Monocular Sketch Image
:sunflower:dataset - Analysis-by-Synthesis Transformer for Single-View 3D Reconstruction
:star:code
- GSD: View-Guided Gaussian Splatting Diffusion for 3D Reconstruction
- 三维形状
- Synchronous Diffusion for Unsupervised Smooth Non-Rigid 3D Shape Matching
- Transferable 3D Adversarial Shape Completion using Diffusion Models
- Self-supervised Shape Completion via Involution and Implicit Correspondences
- TetraDiffusion: Tetrahedral Diffusion Models for 3D Shape Generation
:star:code
:house:project - Learning Neural Deformation Representation for 4D Dynamic Shape Generation
- AWOL: Analysis WithOut synthesis using Language3D shape
- DiscoMatch: Fast Discrete Optimisation for Geometrically Consistent 3D Shape Matching
- 视频重建
- 四维重建
- 3D 纹理形状
1.Other(其它)
- Dataset Growth
:star:code - Adaptive Parametric Activation
:star:code - Nonverbal Interaction Detection
:star:code - Situated Instruction Following
:house:project
:house:project - Optimizing Illuminant Estimation in Dual-Exposure HDR Imaging
- Unsupervised Exposure Correction
:star:code - Global Structure-from-Motion Revisited
:star:code - Fast Sprite Decomposition from Animated Graphics
:house:project - Sync from the Sea: Retrieving Alignable Videos from Large-Scale Datasets
:house:project - MERLiN: Single-Shot Material Estimation and Relighting for Photometric Stereo
- Enhancing Vectorized Map Perception with Historical Rasterized Maps
:star:code - Align before Collaborate: Mitigating Feature Misalignment for Robust Multi-Agent Perception
- DC-Solver: Improving Predictor-Corrector Diffusion Sampler via Dynamic Compensation
:star:code - Weight Conditioning for Smooth Optimization of Neural Networks
- Bones Can't Be Triangles: Accurate and Efficient Vertebrae Keypoint Estimation through Collaborative Error Revision
:star:code - SphereHead: Stable 3D Full-head Synthesis with Spherical Tri-plane Representation
:house:project - VisionLLaMA: A Unified LLaMA Backbone for Vision Tasks
:star:code - Global Counterfactual Directions
- Which Model Generated This Image? A Model-Agnostic Approach for Origin Attribution
- Computing the Lipschitz constant needed for fast scene recovery from CASSI measurements
- Pseudo-Labelling Should Be Aware of Disguising Channel Activations
- FMBoost: Boosting Latent Diffusion with Flow Matching
:star:code - Holodepth: Programmable Depth-Varying Projection via Computer-Generated Holography
:house:project - Adversarial Diffusion Distillation
- When and How do negative prompts take effect
- Preventing Catastrophic Forgetting through Memory Networks in Continuous Detection
- SCOD: From Heuristics to Theory
- Unsupervised Representation Learning by Balanced Self Attention Matching
:star:code - DualBEV: Unifying Dual View Transformation with Probabilistic Correspondences
:star:code - Linking in Style: Understanding learned features in deep learning models
:star:code - CliffPhys: Camera-based Respiratory Measurement using Clifford Neural Networks
- Synthesizing Environment-Specific People in Photographs
:house:project - Implicit Steganography Beyond the Constraints of Modality
- Energy-Clibrated VAE with Test Time Free Lunch
- Debiasing surgeon: fantastic weights and how to find them
- SparseRadNet: Sparse Perception Neural Network on Subsampled Radar Data
- Learning Where to Look: Self-supervised Viewpoint Selection for Active Localization using Geometrical Information
- Using My Artistic Style? You Must Obtain My Authorization
:star:code - IFTR: An Instance-Level Fusion Transformer for Visual Collaborative Perception
:star:code - Convex Relaxations for Manifold-Valued Markov Random Fields with Approximation Guarantees
- Adapting to Shifting Correlations with Unlabeled Data Calibration
- On Spectral Properties of Gradient-based Explanation Methods
- O2V-Mapping: Online Open-Vocabulary Mapping with Neural Implicit Representation
- A Framework for Efficient Model Evaluation through Stratification, Sampling, and Estimation
- Non-Line-of-Sight Estimation of Fast Human Motion with Slow Scanning Imagers
- Image Manipulation Detection With Implicit Neural Representation and Limited Supervision
- Adaptive Bounding Box Uncertainties via Two-Step Conformal Prediction
- GOEmbed: Gradient Origin Embeddings for Representation Agnostic 3D Feature Learning
:house:project - AddBiomechanics Dataset: Capturing the Physics of Human Motion at Scale
:house:project - Tight and Efficient Upper Bound on Spectral Norm of Convolutional Layers
- Learning Multimodal Latent Generative Models with Energy-Based Prior
- Hierarchical Conditioning of Diffusion Models Using Tree-of-Life for Studying Species Evolution
- Learning to Build by Building Your Own Instructions
:star:code - LNL+K: Enhancing Learning with Noisy Labels Through Noise Source Knowledge Integration
:star:code - Deep Online Probability Aggregation Clustering
- Camera Calibration using a Collimator System
:star:code - Asynchronous Bioplausible Neuron for Spiking Neural Networks for Event-Based Vision
- LITA: Language Instructed Temporal-Localization Assistant
:star:code - INTRA: Interaction Relationship-aware Weakly Supervised Affordance Grounding
:house:project - Elucidating the Hierarchical Nature of Behavior with Masked Autoencoders
:star:code - MetaAT: Active Testing for Label-Efficient Evaluation of Dense Recognition Tasks
- Generalizable Symbolic Optimizer Learning
:star:code - Training A Secure Model against Data-Free Model Extraction
- EraseDraw : Learning to Insert Objects by Erasing Them from Images
- AdaDiff: Accelerating Diffusion Models through Step-Wise Adaptive Computation
- Imaging with Confidence: Uncertainty Quantification for High-dimensional Undersampled MR Images
:star:code - Learning to Make Keypoints Sub-Pixel Accurate
:star:code - Explorative Inbetweening of Time and Space
:house:project - Salience-Based Adaptive Masking: Revisiting Token Dynamics for Enhanced Pre-training
:star:code - Instant Uncertainty Calibration of NeRFs Using a Meta-Calibrator
:star:code - Improving Robustness to Model Inversion Attacks via Sparse Coding Architectures
- Constructing Concept-based Models to Mitigate Spurious Correlations with Minimal Human Effort
- Neural Poisson Solver: A Universal and Continuous Framework for Natural Signal Blending
- Augmented Neural Fine-tuning for Efficient Backdoor Purification
- REDIR: Refocus-free Event-based De-occlusion Image Reconstruction
- Comprehensive Attribution: Inherently Explainable Vision Model with Feature Detector
:star:code - Pre-trained Visual Dynamics Representations for Efficient Policy Learning
- MMEarth: Exploring Multi-Modal Pretext Tasks For Geospatial Representation Learning
:house:project - MARs: Multi-view Attention Regularizations for Patch-based Feature Recognition of Space Terrain
:house:project - Diff-Reg: Diffusion Model in Doubly Stochastic Matrix Space for Registration Problem
:star:code - Hetecooper: Feature Collaboration Graph for Heterogeneous Collaborative Perception
- FARSE-CNN: Fully Asynchronous, Recurrent and Sparse Event-Based CNN
:star:code - Unmasking Bias in Diffusion Model Training
:star:code - Cross-Input Certified Training for Universal Perturbations
- Investigating Style Similarity in Diffusion Models
- Delving into Adversarial Robustness on Document Tampering Localization
:star:code - AMD: Automatic Multi-step Distillation of Large-scale Vision Models
- Learning Scalable Model Soup on a Single GPU: An Efficient Subspace Training Strategy
:star:code - JDT3D: Addressing the Gaps in LiDAR-Based Tracking-by-Attention
:star:code - SPARO: Selective Attention for Robust and Compositional Transformer Encodings for Vision
- Adaptive Annealing for Robust Averaging
- Lost in Translation: Modern Neural Networks Still Struggle With Small Realistic Image Transformations
- Generalizing to Unseen Domains via Text-guided Augmentation
- MO-EMT-NAS: Multi-Objective Continuous Transfer of Architectural Knowledge Between Tasks from Different Datasets
- Learning a Dynamic Privacy-preserving Camera Robust to Inversion Attacks
- MaxMI: A Maximal Mutual Information Criterion for Manipulation Concept Discovery
- CadVLM: Bridging Language and Vision in the Generation of Parametric CAD Sketches
- Towards Image Ambient Lighting Normalization
- Synthesizing Time-varying BRDFs via Latent Space
- HoloADMM: High-Quality Holographic Complex Field Recovery
- Fundamental Matrix Estimation Using Relative Depths
- MTaDCS: Moving Trace and Feature Density-based Confidence Sample Selection under Label Noise
:star:code - CipherDM: Secure Three-Party Inference for Diffusion Model Sampling
:star:code - Weighted Ensemble Models Are Strong Continual Learners
:star:code - Learning Equilibrium Transformation for Gamut Expansion and Color Restoration
:star:code - Implicit Neural Models to Extract Heart Rate from Video
:house:project - Learning Quantized Adaptive Conditions for Diffusion Models
- High-Fidelity Modeling of Generalizable Wrinkle Deformation
- Efficient Learning of Event-based Dense Representation using Hierarchical Memories with Adaptive Update
- SlimFlow: Training Smaller One-Step Diffusion Models with Rectified Flow
:star:code - DreamSampler: Unifying Diffusion Sampling and Score Distillation for Image Manipulation
:star:code - PosterLlama: Bridging Design Ability of Langauge Model to Content-Aware Layout Generation
- Integration of Global and Local Representations for Fine-grained Cross-modal Alignment
- Veil Privacy on Visual Data: Concealing Privacy for Humans, Unveiling for DNNs
- A high-quality robust diffusion framework for corrupted dataset
:star:code - FRDiff : Feature Reuse for Universal Training-free Acceleration of Diffusion Models
:star:code - Leveraging Imperfect Restoration for Data Availability Attack
:star:code - Oulu Remote-photoplethysmography Physical Domain Attacks Database (ORPDAD)
:star:code - Spiking Wavelet Transformer
:star:code - Hypernetworks for Generalizable BRDF Representation
:house:project - Solving the inverse problem of microscopy deconvolution with a residual Beylkin-Coifman-Rokhlin neural network
- Photon Inhibition for Energy-Efficient Single-Photon Imaging
:house:project - RANRAC: Robust Neural Scene Representations via Random Ray Consensus
:house:project - Characterizing Model Robustness via Natural Input Gradients
- Emerging Property of Masked Token for Effective Pre-training
- SWAG: Splatting in the Wild images with Appearance-conditioned Gaussians
- Curved Diffusion: A Generative Model With Optical Geometry Control
:house:project - Mini-Splatting: Representing Scenes with a Constrained Number of Gaussians
:star:code - Skeleton Recall Loss for Connectivity Conserving and Resource Efficient Segmentation of Thin Tubular Structures
:star:code - RAP: Retrieval-Augmented Planner for Adaptive Procedure Planning in Instructional Videos
- Rethinking Fast Adversarial Training: A Splitting Technique To Overcome Catastrophic Overfitting
- Optimization-based Uncertainty Attribution Via Learning Informative Perturbations
- CPT-VR: Improving Surface Rendering via Closest Point Transform with View-Reflection Appearance
- Think before Placement: Common Sense Enhanced Transformer for Object Placement
- Efficient Bias Mitigation Without Privileged Information
- Region-Native Visual Tokenization
:star:code - DiffCD: A Symmetric Differentiable Chamfer Distance for Neural Implicit Surface Fitting
:star:code - Efficient Neural Video Representation with Temporally Coherent Modulation
- Made to Order: Discovering monotonic temporal changes via self-supervised video ordering
:star:code - Concise Plane Arrangements for Low-Poly Surface and Volume Modelling
:star:code - ViPer: Visual Personalization of Generative Models via Individual Preference Learning
:star:code - How Far Can a 1-Pixel Camera Go? Solving Vision Tasks using Photoreceptors and Computationally Designed Visual Morphology
- Watching it in Dark: A Target-aware Representation Learning Framework for High-Level Vision Tasks in Low Illumination
:star:code - 3R-INN: How to be climate friendly while consuming/delivering videos
- Model Breadcrumbs: Scaling Multi-Task Model Merging with Sparse Masks
:star:code - Dynamic Guidance Adversarial Distillation with Enhanced Teacher Knowledge
:star:code - Idling Neurons, Appropriately Lenient Workload During Fine-tuning Leads to Better Generalization
- ConDense: Consistent 2D-3D Pre-training for Dense and Sparse Features from Multi-View Images
- Tokenize Anything via Prompting
:star:code
🤗huggingface - Efficient Diffusion-Driven Corruption Editor for Test-Time Adaptation
:star:code - Long-CLIP: Unlocking the Long-Text Capability of CLIP
:star:code - Dolfin: Diffusion Layout Transformers without Autoencoder
- Beyond the Contact: Discovering Comprehensive Affordance for 3D Objects from Pre-trained 2D Diffusion Models
:star:code - Hybrid Video Diffusion Models with 2D Triplane and 3D Wavelet Representation
:star:code - Zero-Shot Image Feature Consensus with Deep Functional Maps
- LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents
:house:project - Scissorhands: Scrub Data Influence via Connection Sensitivity in Networks
:star:code - FuseTeacher: Modality-fused Encoders are Strong Vision Supervisors
:star:code - SmartControl: Enhancing ControlNet for Handling Rough Visual Conditions
:house:project - CoSIGN: Few-Step Guidance of ConSIstency Model to Solve General INverse Problems
:star:code - FedRA: A Random Allocation Strategy for Federated Tuning to Unleash the Power of Heterogeneous Clients
:star:code - Uncertainty Calibration with Energy Based Instance-wise Scaling in the Wild Dataset
:star:code - Learning to Enhance Aperture Phasor Field for Non-Line-of-Sight Imaging
:star:code - UniFS: Universal Few-shot Instance Perception with Point Representations
:star:code - Combining Generative and Geometry Priors for Wide-Angle Portrait Correction
:star:code - FlashTex: Fast Relightable Mesh Texturing with LightControlNet
:house:project重新照明 - Consistent 3D Line Mapping
:star:code - RSL-BA: Rolling Shutter Line Bundle Adjustment
- Discovering Novel Actions from Open World Egocentric Videos with Object-Grounded Visual Commonsense Reasoning
- EAS-SNN: End-to-End Adaptive Sampling and Representation for Event-based Detection with Recurrent Spiking Neural Networks
:star:code - PairingNet: A Learning-based Pair-searching and -matching Network for Image Fragments
- Distributed Active Client Selection With Noisy Clients Using Model Association Scores
- Towards a Density Preserving Objective Function for Learning on Point Sets
- Task-Driven Uncertainty Quantification in Inverse Problems via Conformal Prediction
:star:code - SIGMA: Sinkhorn-Guided Masked Video Modeling
:house:project - LiDAR-Event Stereo Fusion with Hallucinations
:star:code
:house:project - Dual-Camera Smooth Zoom on Mobile Phones
:star:code
:house:project - Learning by Aligning 2D Skeleton Sequences and Multi-Modality Fusion
- Agent Attention: On the Integration of Softmax and Linear Attention
:star:code - Deep Feature Surgery: Towards Accurate and Efficient Multi-Exit Networks
:star:code - Grid-Attention: Enhancing Computational Efficiency of Large Vision Models without Fine-Tuning
:star:code - Customized Generation Reimagined: Fidelity and Editability Harmonized
:star:code - Temporal Residual Guided Diffusion Framework for Event-Driven Video Reconstruction视频重建
- Cascade-Zero123: One Image to Highly Consistent 3D with Self-Prompted Nearby Views
:house:project - Controlling the World by Sleight of Hand
- Probabilistic Weather Forecasting with Deterministic Guidance-based Diffusion Model
:star:code概率天气预报 - Representing Topological Self-Similarity Using Fractal Feature Maps for Accurate Segmentation of Tubular Structures
:star:code - Functional Transform-Based Low-Rank Tensor Factorization for Multi-Dimensional Data Recovery
- G3R: Gradient Guided Generalizable Reconstruction
:house:project - SAIR: Learning Semantic-aware Implicit Representation
- Spectral Subsurface Scattering for Material Classification
- Instance-dependent Noisy-label Learning with Graphical Model Based Noise-rate Estimation
- Tracking Meets LoRA: Faster Training, Larger Model, Stronger Performance
:star:code - A Direct Approach to Viewing Graph Solvability
- Boosting the Power of Small Multimodal Reasoning Models to Match Larger Models with Self-Consistency Training
:star:code - Towards Multimodal Sentiment Analysis Debiasing via Bias Purification
- Improving Feature Stability during Upsampling -- Spectral Artifacts and the Importance of Spatial Context
- From Fake to Real: Pretraining on Balanced Synthetic Images to Prevent Spurious Correlations in Image Recognition
:star:code - Quantization-Friendly Winograd Transformations for Convolutional Neural Networks
- LetsMap: Unsupervised Representation Learning for Label-Efficient Semantic BEV Mapping
- M3DBench: Towards Omni 3D Assistant with Interleaved Multi-modal Instructions
- Leveraging Enhanced Queries of Point Sets for Vectorized Map Construction
:star:code - StereoGlue: Joint Feature Matching and Robust Estimation
:star:code - Factorized Diffusion: Perceptual Illusions by Noise Decomposition
:house:project - GIVT: Generative Infinite-Vocabulary Transformers
:star:code - Tiny Models are the Computational Saver for Large Models
- Unlocking Attributes' Contribution to Successful Camouflage: A Combined Textual and Visual Analysis Strategy
:star:code - SNeRV: Spectra-preserving Neural Representation for Video
:star:code - COMO: Compact Mapping and Odometry
:star:code - Multi-Sentence Grounding for Long-term Instructional Video
- Scene Coordinate Reconstruction: Posing of Image Collections via Incremental Learning of a Relocalizer
:house:project - Exact Diffusion Inversion via Bidirectional Integration Approximation
:star:code - McGrids: Monte Carlo-Driven Adaptive Grids for Iso-Surface Extraction
- Regulating Model Reliance on Non-Robust Features by Smoothing Input Marginal Density
:star:code - Dynamic Data Selection for Efficient SSL via Coarse-to-Fine Refinement
- ZeST: Zero-Shot Material Transfer from a Single Image
:star:code - PCF-Lift: Panoptic Lifting by Probabilistic Contrastive Fusion
- SemGrasp: Semantic Grasp Generation via Language Aligned Discretization
:house:project - DragAPart: Learning a Part-Level Motion Prior for Articulated Objects
:house:project - Superpixel-informed Implicit Neural Representation for Multi-Dimensional Data
- Physics-Free Spectrally Multiplexed Photometric Stereo under Unknown Spectral Composition
- Robust Fitting on a Gate Quantum Computer
- Edge-Guided Fusion and Motion Augmentation for Event-Image Stereo
- Mahalanobis Distance-based Multi-view Optimal Transport for Multi-view Crowd Localization
:house:project - On the Vulnerability of Skip Connections to Model Inversion Attacks
:star:code - Taming CLIP for Fine-grained and Structured Visual Understanding of Museum Exhibits
:star:code - GMM-IKRS: Gaussian Mixture Models for Interpretable Keypoint Refinement and Scoring
- ConDense: Consistent 2D/3D Pre-training for Dense and Sparse Features from Multi-View Images
- Does Data-Efficient Generalization Exacerbate Bias in Foundation Models?
- InfMAE: A Foundation Model in The Infrared Modality红外
- Teach CLIP to Develop a Number Sense for Ordinal Regression
:star:code - GlobalPointer: Large-Scale Plane Adjustment with Bi-Convex Relaxation
:star:code - ColorMAE: Exploring data-independent masking strategies in Masked AutoEncoders
- Scalar Function Topology Divergence: Comparing Topology of 3D Objects
- OneRestore: A Universal Restoration Framework for Composite Degradation
:star:code - RoofDiffusion: Constructing Roofs from Severely Corrupted Point Data via Diffusion
- Binomial Self-compensation for Motion Error in Dynamic 3D Scanning
- Encapsulating Knowledge in One Prompt
:star:code - iMatching: Imperative Correspondence Learning
- An Adaptive Screen-Space Meshing Approach for Normal Integration
- Efficient Pre-training for Localized Instruction Generation of Procedural Videos
:star:code - Shape from Heat Conduction
- Put Myself in Your Shoes: Lifting the Egocentric Perspective from Exocentric Videos
- Optimal Transport of Diverse Unsupervised Tasks for Robust Learning from Noisy Few-Shot Data
- Finding Visual Task Vectors
:star:code - Occupancy as Set of Points
:star:code - Learning to Robustly Reconstruct Dynamic Scenes from Low-light Spike Streams
- AID-AppEAL: Automatic Image Dataset and Algorithm for Content Appeal Enhancement and Assessment Labeling
:star:code - Retargeting Visual Data with Deformation Fields
- Delving Deep into Engagement Prediction of Short Videos
:star:code - Temporal-Mapping Photography for Event Cameras
:star:code - Six-Point Method for Multi-Camera Systems with Reduced Solution Space
:star:code - BK-SDM: A Lightweight, Fast, and Cheap Version of Stable Diffusion
:star:code - Physical-Based Event Camera Simulator
:star:code - REFRAME: Reflective Surface Real-Time Rendering for Mobile Devices
:star:code - Self-Training Room Layout via Geometry-aware Ray-casting
- Closed-Loop Unsupervised Representation Disentanglement with β-VAE Distillation and Diffusion Probabilistic Feedback
- UMG-CLIP: A Unified Multi-Granularity Vision Generalist for Open-World Understanding
:star:code - EventBind: Learning a Unified Representation to Bind Them All for Event-based Open-world Understanding
:star:code - Where am I? Scene Retrieval with Language
- Event Camera Data Dense Pre-training
- Unsqueeze [CLS] Bottleneck to Learn Rich Representations
:star:code - VeCLIP: Improving CLIP Training via Visual-enriched Captions
:star:code - Spike-Temporal Latent Representation for Energy-Efficient Event-to-Video Reconstruction
- Catastrophic Overfitting: A Potential Blessing in Disguise
- Diffusion Reward: Learning Rewards via Conditional Video Diffusion
:house:project - Data-to-Model Distillation: Data-Efficient Learning Framework
- Neural graphics texture compression supporting random access
- ReMatching: Low-Resolution Representations for Scalable Shape Correspondence
- EgoPet: Egomotion and Interaction Data from an Animal's Perspective
:house:project - This Probably Looks Exactly Like That: An Invertible Prototypical Network
:star:code - Revisiting Feature Disentanglement Strategy in Diffusion Training and Breaking Conditional Independence Assumption in Sampling
- ArtVLM: Attribute Recognition Through Vision-Based Prefix Language Modeling
:star:code属性识别 - Stream Query Denoising for Vectorized HD-Map Construction
- PartCraft: Crafting Creative Objects by Parts
:star:code - ControlNet++: Improving Conditional Controls with Efficient Consistency Feedback
:star:code - Dropout Mixture Low-Rank Adaptation for Visual Parameters-Efficient Fine-Tuning
- UNIC: Universal Classification Models via Multi-teacher Distillation
- Efficient Training of Spiking Neural Networks with Multi-Parallel Implicit Stream Architecture
:star:code尖峰神经网络 - Visual Prompting via Partial Optimal Transport
- E3V-K5: An Authentic Benchmark for Redefining Video-Based Energy Expenditure Estimation
:star:code - Understanding Physical Dynamics with Counterfactual World Modeling
:house:project - 4Diff: 3D-Aware Diffusion Model for Third-to-First Viewpoint Translation
:house:project - Revisiting Calibration of Wide-Angle Radially Symmetric Cameras
:star:code相机校准 - STAG4D: Spatial-Temporal Anchored Generative 4D Gaussians
:star:code - Synchronization of Projective Transformations
- UniCal: Unified Neural Sensor Calibration
:house:project - Rawformer: Unpaired Raw-to-Raw Translation for Learnable Camera ISPs
:star:code - Robust Incremental Structure-from-Motion with Hybrid Features
- Any2Point: Empowering Any-modality Transformers for Efficient 3D Understanding
:star:code - CompGS: Smaller and Faster Gaussian Splatting with Vector Quantization
:star:code - Multiscale Graph Texture Network
:star:code - Enhancing Optimization Robustness in 1-bit Neural Networks through Stochastic Sign Descent
:star:code - Domain Reduction Strategy for Non-Line-of-Sight Imaging
:star:code - BI-MDRG: Bridging Image History in Multimodal Dialogue Response Generation
:star:code - Mamba-ND: Selective State Space Modeling for Multi-Dimensional Data
:star:code - Model Stock: All we need is just a few fine-tuned models
:star:code - DreamStruct: Understanding Slides and User Interfaces via Synthetic Data Generation
- DetailSemNet: Elevating Signature Verification through Detail-Semantic Integration
:star:code - SLIM: Spuriousness Mitigation with Minimal Human Annotations
:star:code - Scaling Backwards: Minimal Synthetic Pre-training?
:star:code - On the Evaluation Consistency of Attribution-based Explanations
:star:code - GTP-4o: Modality-prompted Heterogeneous Graph Learning for Omni-modal Biomedical Representation
:star:code - OvSW: Overcoming Silent Weights for Accurate Binary Neural Networks
:star:code - SHINE: Saliency-aware HIerarchical NEgative Ranking for Compositional Temporal Grounding
:star:code - ReGround: Improving Textual and Spatial Grounding at No Cost
:house:project - ComboVerse: Compositional 3D Assets Creation Using Spatially-Aware Diffusion Guidance
:house:project - WHAC: World-grounded Humans and Cameras
:house:project - Unlocking Attributes' Contribution to Successful Camouflage: A Combined Textual and VisualAnalysis Strategy
:star:code - Neural Metamorphosis
:house:project - Light-in-Flight for a World-in-Motion
- Learning with Unmasked Tokens Drives Stronger Vision Learners
:star:code - PSALM: Pixelwise Segmentation with Large Multi-modal Model
:star:code - InsMapper: Exploring Inner-instance Information for Vectorized HD Mapping
:star:code - The All-Seeing Project V2: Towards General Relation Comprehension of the Open World
:star:code - Refine, Discriminate and Align: Stealing Encoders via Sample-Wise Prototypes and Multi-Relational Extraction
- Multi-Task Domain Adaptation for Language Grounding with 3D Objects
:house:project - QueryCDR: Query-based Controllable Distortion Rectification Network for Fisheye Images
:star:code鱼眼图像 - BAMM: Bidirectional Autoregressive Motion Model
:house:project - Handling The Non-Smooth Challenge in Tensor SVD: A Multi-Objective Tensor Recovery Framework
- Latent-INR: A Flexible Framework for Implicit Representations of Videos with Discriminative Semantics
- RPBG: Towards Robust Neural Point-based Graphics in the Wild
:star:code - Memory-Efficient Fine-Tuning for Quantized Diffusion Model
:star:code - Towards Architecture-Agnostic Untrained Networks Priors for Image Reconstruction with Frequency Regularization
:star:code - Similarity of Neural Architectures using Adversarial Attack Transferability
- NeuroPictor: Refining fMRI-to-Image Reconstruction via Multi-individual Pretraining and Multi-level Modulation
:house:project - Robustness Preserving Fine-tuning using Neuron Importance
- A Riemannian Approach for Spatiotemporal Analysis and Generation of 4D Tree-shaped Structures
- Dual-Path Adversarial Lifting for Domain Shift Correction in Online Test-time Adaptation
:star:code - FTBC: Forward Temporal Bias Correction for Optimizing ANN-SNN Conversion
- Test-time Model Adaptation for Image Reconstruction Using Self-supervised Adaptive Layers图像重建
- Unveiling Privacy Risks in Stochastic Neural Networks Training: Effective Image Reconstruction from Gradients图像重建
- CrossScore: A Multi-View Approach to Image Evaluation and Scoring
- ADMap: Anti-disturbance Framework for Vectorized HD Map Construction
- GaussianImage: 1000 FPS Image Representation and Compression by 2D Gaussian Splatting
:star:code - PYRA: Parallel Yielding Re-Activation for Training-Inference Efficient Task Adaptation
:star:code - ReALFRED: An Embodied Instruction Following Benchmark in Photo-Realistic Environments
:star:code - DVLO: Deep Visual-LiDAR Odometry with Local-to-Global Feature Fusion and Bi-Directional Structure Alignment
:star:code - UL-VIO: Ultra-lightweight Visual-Inertial Odometry with Noise Robust Test-time Adaptation视觉惯性里程计
- Real-data-driven 2000 FPS Color Video from Mosaicked Chromatic Spikes
🤗huggingface - RoGUENeRF: A Robust Geometry-Consistent Universal Enhancer for NeRF
:house:project - LaRa: Efficient Large-Baseline Radiance Fields
:star:code - Bi-TTA: Bidirectional Test-Time Adapter for Remote Physiological Measurement
:house:project - ELSE: Efficient Deep Neural Network Inference through Line-based Sparsity Exploration
- Open-World Dynamic Prompt and Continual Visual Representation Learning
- GeoCalib: Learning Single-image Calibration with Geometric Optimization
:star:code - LEIA: Latent View-invariant Embeddings for Implicit 3D Articulation
:star:code - Alignist: CAD-Informed Orientation Distribution Estimation by Fusing Shape and Correspondences
- Weakly-supervised Camera Localization by Ground-to-satellite Image Registration
- Learning Neural Volumetric Pose Features for Camera Localization
:house:project - SPVLoc: Semantic Panoramic Viewport Matching for 6D Camera Localization in Unseen Environments
:star:code - DECOLLAGE: 3D Detailization by Controllable, Localized, and Learned Geometry Enhancement
:star:code - Event-based Mosaicing Bundle Adjustment
:star:code - Reprojection Errors as Prompts for Efficient Scene Coordinate Regression
- Depth on Demand: Streaming Dense Depth from a Low Frame Rate Active Sensor
- AMEGO: Active Memory from long EGOcentric videos
:star:code - Vista3D: Unravel the 3D Darkside of a Single Image
:star:code - Agglomerative Token Clustering
:house:project - Formula-Supervised Visual-Geometric Pre-training
:star:code - Interpretability-Guided Test-Time Adversarial Defense
:star:code - Towards Model-Agnostic Dataset Condensation by Heterogeneous Models
:star:code - MVPGS: Excavating Multi-view Priors for Gaussian Splatting from Sparse Input Views
:star:code - Intrinsic Single-Image HDR Reconstruction
- Disentangled Generation and Aggregation for Robust Radiance Fields
:star:code - Mixture of Efficient Diffusion Experts Through Automatic Interval and Sub-Network Selection
- Commonly Interesting Images
- Sequential Representation Learning via Static-Dynamic Conditional Disentanglement
- QuasiSim: Parameterized Quasi-Physical Simulators for Dexterous Manipulations Transfer
:star:code
:house:project - Dataset Distillation by Automatic Training Trajectories
:star:code - Neural Graphics Texture Compression Supporting Random Acces
- LookupViT: Compressing visual information to a limited number of tokens
- Bridging the Gap: Studio-like Avatar Creation from a Monocular Phone Capture
:house:project - Generating 3D House Wireframes with Semantics
:star:code
:house:project - Flying with Photons: Rendering Novel Views of Propagating Light
:star:code - Deep Nets with Subsampling Layers Unwittingly Discard Useful Activations at Test-Time
:star:code - MobileNetV4: Universal Models for the Mobile Ecosystem
- Gravity-aligned Rotation Averaging with Circular Regression
:star:code - Concept Sliders: LoRA Adaptors for Precise Control in Diffusion Models
:house:project - HYPE: Hyperbolic Entailment Filtering for Underspecified Images and Texts
- DoughNet: A Visual Predictive Model for Topological Manipulation of Deformable Objects
:house:project - TrajPrompt: Aligning Color Trajectory with Vision-Language Representations
:star:code - DomainFusion: Generalizing To Unseen Domains with Latent Diffusion Models
- Toward Tiny and High-quality Facial Makeup with Data Amplify Learning
:star:code - Multi-Label Cluster Discrimination for Visual Representation Learning
- Preventing Catastrophic Overfitting in Fast Adversarial Training: A Bi-level Optimization Perspective
- MemBN: Robust Test-Time Adaptation via Batch Norm with Statistics Memory
- SeiT++: Masked Token Modeling Improves Storage-efficient Training
:star:code - MagicEraser: Erasing Any Objects via Semantics-Aware Control
- Reliable Spatial-Temporal Voxels For Multi-Modal Test-Time Adaptation
:house:project - A Cephalometric Landmark Regression Method based on Dual-encoder for High-resolution X-ray Image
:star:code - Resilience of Entropy Model in Distributed Neural Networks
:star:code - GeoWizard: Unleashing the Diffusion Priors for 3D Geometry Estimation from a Single Image
:house:project - MotionChain: Conversational Motion Controllers via Multimodal Prompts
- MacDiff: Unified Skeleton Modeling with Masked Conditional Diffusion
:house:project - Improving Intervention Efficacy via Concept Realignment in Concept Bottleneck Models
:star:code - Brain Netflix: Scaling Data to Reconstruct Videos from Brain Signals
- Latent-INR: A Flexible Framework for Implicit Representations of Videos with Discriminative Semantics
- Tensorial template matching for fast cross-correlation with rotations and its application for tomography
- SelfGeo: Self-supervised and Geodesic-consistent Estimation of Keypoints on Deformable Shapes
:star:code - Explain via Any Concept: Concept Bottleneck Model with Open Vocabulary Concepts
- Motion and Structure from Event-based Normal Flow
:house:project - SENC: Handling Self-collision in Neural Cloth Simulation
- Distribution Alignment for Fully Test-Time Adaptation with Dynamic Online Data Streams
- Animate Your Motion: Turning Still Images into Dynamic Videos
:house:project - Gaussian Splatting on the Move:Blur and Rolling Shutter Compensation for Natural Camera Motion
:star:code
:house:project - Relightable Neural Actor with Intrinsic Decomposition and Pose Control
:house:project - Layer-Wise Relevance Propagation with Conservation Property for ResNet
:house:project - Self-Rectifying Diffusion Sampling with Perturbed-Attention Guidance
:house:project - SparseSSP: 3D Subcellular Structure Prediction from Sparse-View Transmitted Light Images
- ViG-Bias: Visually Grounded Bias Discovery and Mitigation
- DOCCI: Descriptions of Connected and Contrasting Images
:house:project - Geometry Fidelity for Spherical Images
- Efficient Inference of Vision Instruction-Following Models with Elastic Cache
:star:code - Mew: Multiplexed Immunofluorescence Image Analysis through an Efficient Multiplex Network
:star:code - Topology-Preserving Downsampling of Binary Images
- Quality Assured: Rethinking Annotation Strategies in Imaging AI
- Chronologically Accurate Retrieval for Temporal Grounding of Motion-Language Models
:house:project - Data Collection-free Masked Video Modeling
- Efficient Image Pre-Training with Siamese Cropped Masked Autoencoders
:star:code - Möbius Transform for Mitigating Perspective Distortions in Representation Learning
:house:project - Foster Adaptivity and Balance in Learning with Noisy Labels
:star:code
无需先验知识即可高效解决深度学习中的噪声标签问题,让模型性能和鲁棒性大幅提升! - Solving Motion Planning Tasks with a Scalable Generative Model
:star:code - 4D Contrastive Superflows are Dense 3D Representation Learners
:star:code - Learning to Complement and to Defer to Multiple Users
:star:code - Shedding More Light on Robust Classifiers under the lens of Energy-based Models
:star:code - TIP: Tabular-Image Pre-training for Multimodal Classification with Incomplete Data
:star:code - UMBRAE: Unified Multimodal Brain Decoding
:star:code
:house:project - Trainable Highly-expressive Activation Functions
:star:code - Controllable Navigation Instruction Generation with Chain of Thought Prompting
- Recursive Visual Programming
:star:code - Reshaping the Online Data Buffering and Organizing Mechanism for Continual Test-Time Adaptation
:star:code - Imaging Interiors: An Implicit Solution to Electromagnetic Inverse Scattering Problems
:star:code - The Gaussian Discriminant Variational Autoencoder (GdVAE): A Self-Explainable Model with Counterfactual Explanations
:star:code
:house:project - HyperSpaceX: Radial and Angular Exploration of HyperSpherical Dimensions
- DataDream: Few-shot Guided Dataset Generation
:star:code - Aligning Neuronal Coding of Dynamic Visual Scenes with Foundation Vision Models
:star:code
:house:project - Towards Robust Event-based Networks for Nighttime via Unpaired Day-to-Night Event Translation
:star:code - FRI-Net: Floorplan Reconstruction via Room-wise Implicit Representation
- Hierarchically Structured Neural Bones for Reconstructing Animatable Objects from Casual Videos
:star:code - Deep Diffusion Image Prior for Efficient OOD Adaptation in 3D Inverse Problems
:star:code - Pathformer3D: A 3D Scanpath Transformer for 360° Images
:star:code - Kinetic Typography Diffusion Model
:star:code - PolyRoom: Room-aware Transformer for Floorplan Reconstruction
:star:code - Tree-D Fusion: Simulation-Ready Tree Dataset from Single Images with Diffusion Priors
- Multiscale Sliced Wasserstein Distances as Perceptual Color Difference Measures
:star:code - Augmented Neural Fine-Tuning for Efficient Backdoor Purification
- Improving Hyperbolic Representations via Gromov-Wasserstein Regularization
- Mutual Learning for Acoustic Matching and Dereverberation via Visual Scene-driven Diffusion
:star:code - Efficient Training with Denoised Neural Weights
:star:code - SpaceJAM: a Lightweight and Regularization-free Method for Fast Joint Alignment of Images
:star:code - Discover-then-Name: Task-Agnostic Concept Bottlenecks via Automated Concept Discovery
:star:code - SceneScript: Reconstructing Scenes With An Autoregressive Structured Language Model
:house:project - Multi-modal Relation Distillation for Unified 3D Representation Learning
- Continual Learning for Remote Physiological Measurement: Minimize Forgetting and Simplify Inference
:star:code - TreeSBA: Tree-Transformer for Self-Supervised Sequential Brick Assembly
- SIGMA:Sinkhorn-Guided Masked Video Modeling
:star:code - Attention Beats Linear for Fast Implicit Neural Representation Generation
:star:code - Text2Place: Affordance-aware Text Guided Human Placement
:star:code - RoadPainter: Points Are Ideal Navigators for Topology transformER
- STAMP: Outlier-Aware Test-Time Adaptation with Stable Memory Replay
:star:code - Differentiable Convex Polyhedra Optimization from Multi-view Images
:star:code - A Diffusion Model for Simulation Ready Coronary Anatomy with Morpho-skeletal Control
:star:code - Power Variable Projection for Initialization-Free Large-Scale Bundle Adjustment
- Multi-label Cluster Discrimination for Visual Representation Learning
- SINDER: Repairing the Singular Defects of DINOv2
:star:code - SHIC: Shape-Image Correspondences with no Keypoint Supervision
:house:project - Semicalibrated Relative Pose from an Affine Correspondence and Monodepth相对位姿半校准
- Scalable Group Choreography via Variational Phase Manifold Learning
- Deep Companion Learning: Enhancing Generalization Through Historical Consistency
- Revisit Event Generation Model: Self-Supervised Learning of Event-to-Video Reconstruction with Implicit Neural Representations
:star:code - Neural Surface Detection for Unsigned Distance Fields
- Merging and Splitting Diffusion Paths for Semantically Coherent Panoramas
:star:code - Platypus: A Generalized Specialist Model for Reading Text in Various Forms
:star:code - RAW-Adapter: Adapting Pre-trained Visual Model to Camera RAW Images
:star:code - Learning Differentially Private Diffusion Models via Stochastic Adversarial Distillation
- Affine steerers for structured keypoint description
- SeA: Semantic Adversarial Augmentation for Last Layer Features from Unsupervised Representation Learning
:star:code - MMBench: Is Your Multi-modal Model an All-around Player?
:star:code
:house:project - DenseNets Reloaded: Paradigm Shift Beyond ResNets and ViTs
:star:code - PreLAR: World Model Pre-training with Learnable Action Representation
:star:code - Dataset Enhancement with Instance-Level Augmentations
:star:code - Non-parametric Sensor Noise Modeling and Synthesis
- Stripe Observation Guided Inference Cost-free Attention Mechanism
:star:code - Leveraging Hierarchical Feature Sharing for Efficient Dataset Condensation
- Object-Aware NIR-to-Visible Translation
:star:code
:sunflower:datasetLow-level Vision
2020 年论文分类汇总戳这里
↘️CVPR-2020-Papers ↘️ECCV-2020-Papers
2021 年论文分类汇总戳这里
↘️ICCV-2021-Papers ↘️CVPR-2021-Papers
2022 年论文分类汇总戳这里
↘️CVPR-2022-Papers ↘️WACV-2022-Papers ↘️ECCV-2022-Papers
2023 年论文分类汇总戳这里
↘️CVPR-2023-Papers ↘️WACV-2023-Papers ↘️ICCV-2023-Papers
扫码CV君微信(注明:CVPR)入微信交流群:
# ECCV-2024-Papers