Computer Vision and Pattern Recognition 87
♻ ☆ ActiveGAMER: Active GAussian Mapping through Efficient Rendering CVPR2025
We introduce ActiveGAMER, an active mapping system that utilizes 3D Gaussian
Splatting (3DGS) to achieve high-quality, real-time scene mapping and
exploration. Unlike traditional NeRF-based methods, which are computationally
demanding and restrict active mapping performance, our approach leverages the
efficient rendering capabilities of 3DGS, allowing effective and efficient
exploration in complex environments. The core of our system is a
rendering-based information gain module that dynamically identifies the most
informative viewpoints for next-best-view planning, enhancing both geometric
and photometric reconstruction accuracy. ActiveGAMER also integrates a
carefully balanced framework, combining coarse-to-fine exploration,
post-refinement, and a global-local keyframe selection strategy to maximize
reconstruction completeness and fidelity. Our system autonomously explores and
reconstructs environments with state-of-the-art geometric and photometric
accuracy and completeness, significantly surpassing existing approaches in both
aspects. Extensive evaluations on benchmark datasets such as Replica and MP3D
highlight ActiveGAMER's effectiveness in active mapping tasks.
comment: Accepted to CVPR2025
♻ ☆ NTSEBENCH: Cognitive Reasoning Benchmark for Vision Language Models
Cognitive textual and visual reasoning tasks, including puzzles, series, and
analogies, demand the ability to quickly reason, decipher, and evaluate
patterns both textually and spatially. Due to extensive training on vast
amounts of human-curated data, LLMs and VLMs excel in common-sense reasoning
tasks, however still struggle with more complex reasoning that demands deeper
cognitive understanding. We introduce NTSEBench, a new dataset designed to
evaluate cognitive multi-modal reasoning and problem-solving skills of large
models. The dataset contains 2728 multiple-choice questions, accompanied by a
total of 4,642 images, categorized into 26 different types. These questions are
drawn from the nationwide NTSE examination in India and feature a mix of visual
and textual general aptitude challenges, designed to assess intelligence and
critical thinking skills beyond mere rote learning. We establish baselines on
the dataset using state-of-the-art LLMs and VLMs. To facilitate a comparison
between open source and propriety models, we propose four distinct modeling
strategies to handle different modalities -- text and images -- in the dataset
instances.
comment: 28 pages, 3 figures, 12 tables
♻ ☆ Rehearsal-free Federated Domain-incremental Learning IEEE
We introduce a rehearsal-free federated domain incremental learning
framework, RefFiL, based on a global prompt-sharing paradigm to alleviate
catastrophic forgetting challenges in federated domain-incremental learning,
where unseen domains are continually learned. Typical methods for mitigating
forgetting, such as the use of additional datasets and the retention of private
data from earlier tasks, are not viable in federated learning (FL) due to
devices' limited resources. Our method, RefFiL, addresses this by learning
domain-invariant knowledge and incorporating various domain-specific prompts
from the domains represented by different FL participants. A key feature of
RefFiL is the generation of local fine-grained prompts by our domain adaptive
prompt generator, which effectively learns from local domain knowledge while
maintaining distinctive boundaries on a global scale. We also introduce a
domain-specific prompt contrastive learning loss that differentiates between
locally generated prompts and those from other domains, enhancing RefFiL's
precision and effectiveness. Compared to existing methods, RefFiL significantly
alleviates catastrophic forgetting without requiring extra memory space, making
it ideal for privacy-sensitive and resource-constrained devices.
comment: Camera ready version. Accepted by the IEEE ICDCS, 2025
♻ ☆ DetailGen3D: Generative 3D Geometry Enhancement via Data-Dependent Flow
Ken Deng, Yuan-Chen Guo, Jingxiang Sun, Zi-Xin Zou, Yangguang Li, Xin Cai, Yan-Pei Cao, Yebin Liu, Ding Liang
Modern 3D generation methods can rapidly create shapes from sparse or single
views, but their outputs often lack geometric detail due to computational
constraints. We present DetailGen3D, a generative approach specifically
designed to enhance these generated 3D shapes. Our key insight is to model the
coarse-to-fine transformation directly through data-dependent flows in latent
space, avoiding the computational overhead of large-scale 3D generative models.
We introduce a token matching strategy that ensures accurate spatial
correspondence during refinement, enabling local detail synthesis while
preserving global structure. By carefully designing our training data to match
the characteristics of synthesized coarse shapes, our method can effectively
enhance shapes produced by various 3D generation and reconstruction approaches,
from single-view to sparse multi-view inputs. Extensive experiments demonstrate
that DetailGen3D achieves high-fidelity geometric detail synthesis while
maintaining efficiency in training.
♻ ☆ IDArb: Intrinsic Decomposition for Arbitrary Number of Input Views and Illuminations ICLR 2025
Capturing geometric and material information from images remains a
fundamental challenge in computer vision and graphics. Traditional
optimization-based methods often require hours of computational time to
reconstruct geometry, material properties, and environmental lighting from
dense multi-view inputs, while still struggling with inherent ambiguities
between lighting and material. On the other hand, learning-based approaches
leverage rich material priors from existing 3D object datasets but face
challenges with maintaining multi-view consistency. In this paper, we introduce
IDArb, a diffusion-based model designed to perform intrinsic decomposition on
an arbitrary number of images under varying illuminations. Our method achieves
accurate and multi-view consistent estimation on surface normals and material
properties. This is made possible through a novel cross-view, cross-domain
attention module and an illumination-augmented, view-adaptive training
strategy. Additionally, we introduce ARB-Objaverse, a new dataset that provides
large-scale multi-view intrinsic data and renderings under diverse lighting
conditions, supporting robust training. Extensive experiments demonstrate that
IDArb outperforms state-of-the-art methods both qualitatively and
quantitatively. Moreover, our approach facilitates a range of downstream tasks,
including single-image relighting, photometric stereo, and 3D reconstruction,
highlighting its broad applications in realistic 3D content creation.
comment: ICLR 2025. Project Page: https://lizb6626.github.io/IDArb/
♻ ☆ Oriented Object Detection in Optical Remote Sensing Images using Deep Learning: A Survey
Oriented object detection is one of the most fundamental and challenging
tasks in remote sensing, aiming to locate and classify objects with arbitrary
orientations. Recent advancements in deep learning have significantly enhanced
the capabilities of oriented object detection. Given the rapid development of
this field, this paper presents a comprehensive survey of recent advances in
oriented object detection. To be specific, we begin by tracing the technical
evolution from horizontal object detection to oriented object detection and
highlighting the specific challenges, including feature misalignment, spatial
misalignment, and oriented bounding box (OBB) regression problems.
Subsequently, we further categorize existing methods into detection framework,
OBB regression, and feature representations, and provide an in-depth discussion
on how these approaches address the above challenges. In addition, we cover
several publicly available datasets and evaluation protocols. Furthermore, we
provide a comprehensive comparison and analysis of state-of-the-art methods.
Toward the end of this paper, we identify several future directions for
oriented object detection.
♻ ☆ Mixture of Experts Made Personalized: Federated Prompt Learning for Vision-Language Models ICLR 2025
Federated prompt learning benefits federated learning with CLIP-like
Vision-Language Model's (VLM's) robust representation learning ability through
prompt learning. However, current federated prompt learning methods are
habitually restricted to the traditional FL paradigm, where the participating
clients are generally only allowed to download a single globally aggregated
model from the server. While justifiable for training full-sized models under
federated settings, in this work, we argue that this paradigm is ill-suited for
lightweight prompts. By facilitating the clients to download multiple
pre-aggregated prompts as fixed non-local experts, we propose Personalized
Federated Mixture of Adaptive Prompts (pFedMoAP), a novel FL framework that
personalizes the prompt learning process through the lens of Mixture of Experts
(MoE). pFedMoAP implements a local attention-based gating network that learns
to generate enhanced text features for better alignment with local image data,
benefiting from both local and downloaded non-local adaptive prompt experts.
Extensive experiments on 9 datasets under various federated settings
demonstrate the efficacy of the proposed pFedMoAP algorithm. The code is
available at https://github.com/ljaiverson/pFedMoAP.
comment: ICLR 2025
♻ ☆ HCMA-UNet: A Hybrid CNN-Mamba UNet with Axial Self-Attention for Efficient Breast Cancer Segmentation
Breast cancer lesion segmentation in DCE-MRI remains challenging due to
heterogeneous tumor morphology and indistinct boundaries. To address these
challenges, this study proposes a novel hybrid segmentation network, HCMA-UNet,
for lesion segmentation of breast cancer. Our network consists of a lightweight
CNN backbone and a Multi-view Axial Self-Attention Mamba (MISM) module. The
MISM module integrates Visual State Space Block (VSSB) and Axial Self-Attention
(ASA) mechanism, effectively reducing parameters through Asymmetric Split
Channel (ASC) strategy to achieve efficient tri-directional feature extraction.
Our lightweight model achieves superior performance with 2.87M parameters and
126.44 GFLOPs. A Feature-guided Region-aware loss function (FRLoss) is proposed
to enhance segmentation accuracy. Extensive experiments on one private and two
public DCE-MRI breast cancer datasets demonstrate that our approach achieves
state-of-the-art performance while maintaining computational efficiency. FRLoss
also exhibits good cross-architecture generalization capabilities. The source
code is available at https://github.com/Haoxuanli-Thu/HCMA-UNet.
♻ ☆ HyperSIGMA: Hyperspectral Intelligence Comprehension Foundation Model IEEE
Di Wang, Meiqi Hu, Yao Jin, Yuchun Miao, Jiaqi Yang, Yichu Xu, Xiaolei Qin, Jiaqi Ma, Lingyu Sun, Chenxing Li, Chuan Fu, Hongruixuan Chen, Chengxi Han, Naoto Yokoya, Jing Zhang, Minqiang Xu, Lin Liu, Lefei Zhang, Chen Wu, Bo Du, Dacheng Tao, Liangpei Zhang
Accurate hyperspectral image (HSI) interpretation is critical for providing
valuable insights into various earth observation-related applications such as
urban planning, precision agriculture, and environmental monitoring. However,
existing HSI processing methods are predominantly task-specific and
scene-dependent, which severely limits their ability to transfer knowledge
across tasks and scenes, thereby reducing the practicality in real-world
applications. To address these challenges, we present HyperSIGMA, a vision
transformer-based foundation model that unifies HSI interpretation across tasks
and scenes, scalable to over one billion parameters. To overcome the spectral
and spatial redundancy inherent in HSIs, we introduce a novel sparse sampling
attention (SSA) mechanism, which effectively promotes the learning of diverse
contextual features and serves as the basic block of HyperSIGMA. HyperSIGMA
integrates spatial and spectral features using a specially designed spectral
enhancement module. In addition, we construct a large-scale hyperspectral
dataset, HyperGlobal-450K, for pre-training, which contains about 450K
hyperspectral images, significantly surpassing existing datasets in scale.
Extensive experiments on various high-level and low-level HSI tasks demonstrate
HyperSIGMA's versatility and superior representational capability compared to
current state-of-the-art methods. Moreover, HyperSIGMA shows significant
advantages in scalability, robustness, cross-modal transferring capability,
real-world applicability, and computational efficiency. The code and models
will be released at https://github.com/WHU-Sigma/HyperSIGMA.
comment: Accepted by IEEE TPAMI. Project website:
https://whu-sigma.github.io/HyperSIGMA
♻ ☆ Mind the GAP: Glimpse-based Active Perception improves generalization and sample efficiency of visual reasoning
Human capabilities in understanding visual relations are far superior to
those of AI systems, especially for previously unseen objects. For example,
while AI systems struggle to determine whether two such objects are visually
the same or different, humans can do so with ease. Active vision theories
postulate that the learning of visual relations is grounded in actions that we
take to fixate objects and their parts by moving our eyes. In particular, the
low-dimensional spatial information about the corresponding eye movements is
hypothesized to facilitate the representation of relations between different
image parts. Inspired by these theories, we develop a system equipped with a
novel Glimpse-based Active Perception (GAP) that sequentially glimpses at the
most salient regions of the input image and processes them at high resolution.
Importantly, our system leverages the locations stemming from the glimpsing
actions, along with the visual content around them, to represent relations
between different parts of the image. The results suggest that the GAP is
essential for extracting visual relations that go beyond the immediate visual
content. Our approach reaches state-of-the-art performance on several visual
reasoning tasks being more sample-efficient, and generalizing better to
out-of-distribution visual inputs than prior models.
comment: 10 pages of main text and 8 pages appendices
♻ ☆ RedMotion: Motion Prediction via Redundancy Reduction
We introduce RedMotion, a transformer model for motion prediction in
self-driving vehicles that learns environment representations via redundancy
reduction. Our first type of redundancy reduction is induced by an internal
transformer decoder and reduces a variable-sized set of local road environment
tokens, representing road graphs and agent data, to a fixed-sized global
embedding. The second type of redundancy reduction is obtained by
self-supervised learning and applies the redundancy reduction principle to
embeddings generated from augmented views of road environments. Our experiments
reveal that our representation learning approach outperforms PreTraM, Traj-MAE,
and GraphDINO in a semi-supervised setting. Moreover, RedMotion achieves
competitive results compared to HPTR or MTR++ in the Waymo Motion Prediction
Challenge. Our open-source implementation is available at:
https://github.com/kit-mrt/future-motion
comment: TMLR published version
♻ ☆ Fine-Grained Behavior and Lane Constraints Guided Trajectory Prediction Method IEEE
Trajectory prediction, as a critical component of autonomous driving systems,
has attracted the attention of many researchers. Existing prediction algorithms
focus on extracting more detailed scene features or selecting more reasonable
trajectory destinations. However, in the face of dynamic and evolving future
movements of the target vehicle, these algorithms cannot provide a fine-grained
and continuous description of future behaviors and lane constraints, which
degrades the prediction accuracy. To address this challenge, we present BLNet,
a novel dualstream architecture that synergistically integrates behavioral
intention recognition and lane constraint modeling through parallel attention
mechanisms. The framework generates fine-grained behavior state queries
(capturing spatial-temporal movement patterns) and lane queries (encoding lane
topology constraints), supervised by two auxiliary losses, respectively.
Subsequently, a two-stage decoder first produces trajectory proposals, then
performs point-level refinement by jointly incorporating both the continuity of
passed lanes and future motion features. Extensive experiments on two large
datasets, nuScenes and Argoverse, show that our network exhibits significant
performance gains over existing direct regression and goal-based algorithms.
comment: This work has been submitted to the IEEE for possible publication
♻ ☆ RePoseD: Efficient Relative Pose Estimation With Known Depth Information
Recent advances in monocular depth estimation methods (MDE) and their
improved accuracy open new possibilities for their applications. In this paper,
we investigate how monocular depth estimates can be used for relative pose
estimation. In particular, we are interested in answering the question whether
using MDEs improves results over traditional point-based methods. We propose a
novel framework for estimating the relative pose of two cameras from point
correspondences with associated monocular depths. Since depth predictions are
typically defined up to an unknown scale or even both unknown scale and shift
parameters, our solvers jointly estimate the scale or both the scale and shift
parameters along with the relative pose. We derive efficient solvers
considering different types of depths for three camera configurations: (1) two
calibrated cameras, (2) two cameras with an unknown shared focal length, and
(3) two cameras with unknown different focal lengths. Our new solvers
outperform state-of-the-art depth-aware solvers in terms of speed and accuracy.
In extensive real experiments on multiple datasets and with various MDEs, we
discuss which depth-aware solvers are preferable in which situation. The code
will be made publicly available.
comment: 18 pages
♻ ☆ Enhanced Controllability of Diffusion Models via Feature Disentanglement and Realism-Enhanced Sampling Methods ECCV 2024
Wonwoong Cho, Hareesh Ravi, Midhun Harikumar, Vinh Khuc, Krishna Kumar Singh, Jingwan Lu, David I. Inouye, Ajinkya Kale
As Diffusion Models have shown promising performance, a lot of efforts have
been made to improve the controllability of Diffusion Models. However, how to
train Diffusion Models to have the disentangled latent spaces and how to
naturally incorporate the disentangled conditions during the sampling process
have been underexplored. In this paper, we present a training framework for
feature disentanglement of Diffusion Models (FDiff). We further propose two
sampling methods that can boost the realism of our Diffusion Models and also
enhance the controllability. Concisely, we train Diffusion Models conditioned
on two latent features, a spatial content mask, and a flattened style
embedding. We rely on the inductive bias of the denoising process of Diffusion
Models to encode pose/layout information in the content feature and
semantic/style information in the style feature. Regarding the sampling
methods, we first generalize Composable Diffusion Models (GCDM) by breaking the
conditional independence assumption to allow for some dependence between
conditional inputs, which is shown to be effective in realistic generation in
our experiments. Second, we propose timestep-dependent weight scheduling for
content and style features to further improve the performance. We also observe
better controllability of our proposed methods compared to existing methods in
image manipulation and image translation.
comment: ECCV 2024; Code will be opened after a patent application is granted
♻ ☆ Att-Adapter: A Robust and Precise Domain-Specific Multi-Attributes T2I Diffusion Adapter via Conditional Variational Autoencoder
Text-to-Image (T2I) Diffusion Models have achieved remarkable performance in
generating high quality images. However, enabling precise control of continuous
attributes, especially multiple attributes simultaneously, in a new domain
(e.g., numeric values like eye openness or car width) with text-only guidance
remains a significant challenge. To address this, we introduce the Attribute
(Att) Adapter, a novel plug-and-play module designed to enable fine-grained,
multi-attributes control in pretrained diffusion models. Our approach learns a
single control adapter from a set of sample images that can be unpaired and
contain multiple visual attributes. The Att-Adapter leverages the decoupled
cross attention module to naturally harmonize the multiple domain attributes
with text conditioning. We further introduce Conditional Variational
Autoencoder (CVAE) to the Att-Adapter to mitigate overfitting, matching the
diverse nature of the visual world. Evaluations on two public datasets show
that Att-Adapter outperforms all LoRA-based baselines in controlling continuous
attributes. Additionally, our method enables a broader control range and also
improves disentanglement across multiple attributes, surpassing StyleGAN-based
techniques. Notably, Att-Adapter is flexible, requiring no paired synthetic
data for training, and is easily scalable to multiple attributes within a
single model.
♻ ☆ MSCMNet: Multi-scale Semantic Correlation Mining for Visible-Infrared Person Re-Identification
The main challenge in the Visible-Infrared Person Re-Identification (VI-ReID)
task lies in how to extract discriminative features from different modalities
for matching purposes. While the existing well works primarily focus on
minimizing the modal discrepancies, the modality information can not thoroughly
be leveraged. To solve this problem, a Multi-scale Semantic Correlation Mining
network (MSCMNet) is proposed to comprehensively exploit semantic features at
multiple scales and simultaneously reduce modality information loss as small as
possible in feature extraction. The proposed network contains three novel
components. Firstly, after taking into account the effective utilization of
modality information, the Multi-scale Information Correlation Mining Block
(MIMB) is designed to explore semantic correlations across multiple scales.
Secondly, in order to enrich the semantic information that MIMB can utilize, a
quadruple-stream feature extractor (QFE) with non-shared parameters is
specifically designed to extract information from different dimensions of the
dataset. Finally, the Quadruple Center Triplet Loss (QCT) is further proposed
to address the information discrepancy in the comprehensive features. Extensive
experiments on the SYSU-MM01, RegDB, and LLCM datasets demonstrate that the
proposed MSCMNet achieves the greatest accuracy.
♻ ☆ Efficient Semantic Segmentation via Lightweight Multiple-Information Interaction Network
Recently, integrating the local modeling capabilities of Convolutional Neural
Networks (CNNs) with the global dependency strengths of Transformers has
created a sensation in the semantic segmentation community. However,
substantial computational workloads and high hardware memory demands remain
major obstacles to their further application in real-time scenarios. In this
work, we propose a Lightweight Multiple-Information Interaction Network
(LMIINet) for real-time semantic segmentation, which effectively combines CNNs
and Transformers while reducing redundant computations and memory footprints.
It features Lightweight Feature Interaction Bottleneck (LFIB) modules
comprising efficient convolutions that enhance context integration.
Additionally, improvements are made to the Flatten Transformer by enhancing
local and global feature interaction to capture detailed semantic information.
Incorporating a combination coefficient learning scheme in both LFIB and
Transformer blocks facilitates improved feature interaction. Extensive
experiments demonstrate that LMIINet excels in balancing accuracy and
efficiency. With only 0.72M parameters and 11.74G FLOPs (Floating Point
Operations Per Second), LMIINet achieves 72.0\% mIoU at 100 FPS (Frames Per
Second) on the Cityscapes test set and 69.94\% mIoU (mean Intersection over
Union) at 160 FPS on the CamVid test dataset using a single RTX2080Ti GPU.
comment: 10 pages, 6 figures, 9 tables
♻ ☆ A Comparative Study of Scanpath Models in Graph-Based Visualization
Angela Lopez-Cardona, Parvin Emami, Sebastian Idesis, Saravanakumar Duraisamy, Luis A. Leiva, Ioannis Arapakis
Information Visualization (InfoVis) systems utilize visual representations to
enhance data interpretation. Understanding how visual attention is allocated is
essential for optimizing interface design. However, collecting Eye-tracking
(ET) data presents challenges related to cost, privacy, and scalability.
Computational models provide alternatives for predicting gaze patterns, thereby
advancing InfoVis research. In our study, we conducted an ET experiment with 40
participants who analyzed graphs while responding to questions of varying
complexity within the context of digital forensics. We compared human scanpaths
with synthetic ones generated by models such as DeepGaze, UMSS, and Gazeformer.
Our research evaluates the accuracy of these models and examines how question
complexity and number of nodes influence performance. This work contributes to
the development of predictive modeling in visual analytics, offering insights
that can enhance the design and effectiveness of InfoVis systems.
♻ ☆ ConsistencyDet: A Few-step Denoising Framework for Object Detection Using the Consistency Model
Object detection, a quintessential task in the realm of perceptual computing,
can be tackled using a generative methodology. In the present study, we
introduce a novel framework designed to articulate object detection as a
denoising diffusion process, which operates on the perturbed bounding boxes of
annotated entities. This framework, termed \textbf{ConsistencyDet}, leverages
an innovative denoising concept known as the Consistency Model. The hallmark of
this model is its self-consistency feature, which empowers the model to map
distorted information from any time step back to its pristine state, thereby
realizing a \textbf{``few-step denoising''} mechanism. Such an attribute
markedly elevates the operational efficiency of the model, setting it apart
from the conventional Diffusion Model. Throughout the training phase,
ConsistencyDet initiates the diffusion sequence with noise-infused boxes
derived from the ground-truth annotations and conditions the model to perform
the denoising task. Subsequently, in the inference stage, the model employs a
denoising sampling strategy that commences with bounding boxes randomly sampled
from a normal distribution. Through iterative refinement, the model transforms
an assortment of arbitrarily generated boxes into definitive detections.
Comprehensive evaluations employing standard benchmarks, such as MS-COCO and
LVIS, corroborate that ConsistencyDet surpasses other leading-edge detectors in
performance metrics. Our code is available at
https://anonymous.4open.science/r/ConsistencyDet-37D5.
♻ ☆ SVInvNet: A Densely Connected Encoder-Decoder Architecture for Seismic Velocity Inversion
This study presents a deep learning-based approach to seismic velocity
inversion problem, focusing on both noisy and noiseless training datasets of
varying sizes. Our Seismic Velocity Inversion Network (SVInvNet) introduces a
novel architecture that contains a multi-connection encoder-decoder structure
enhanced with dense blocks. This design is specifically tuned to effectively
process time series data, which is essential for addressing the challenges of
non-linear seismic velocity inversion. For training and testing, we created
diverse seismic velocity models, including multi-layered, faulty, and salt dome
categories. We also investigated how different kinds of ambient noise, both
coherent and stochastic, and the size of the training dataset affect learning
outcomes. SVInvNet is trained on datasets ranging from 750 to 6,000 samples and
is tested using a large benchmark dataset of 12,000 samples. Despite its fewer
parameters compared to the baseline model, SVInvNet achieves superior
performance with this dataset. The performance of SVInvNet was further
evaluated using the OpenFWI dataset and Marmousi-derived velocity models. The
comparative analysis clearly reveals the effectiveness of the proposed model.
comment: This is the preprint of the accepted manuscript to appear in IEEE
Transactions on Geoscience and Remote Sensing
♻ ☆ Self-Supervised Pretraining for Aerial Road Extraction IEEE
Deep neural networks for aerial image segmentation require large amounts of
labeled data, but high-quality aerial datasets with precise annotations are
scarce and costly to produce. To address this limitation, we propose a
self-supervised pretraining method that improves segmentation performance while
reducing reliance on labeled data. Our approach uses inpainting-based
pretraining, where the model learns to reconstruct missing regions in aerial
images, capturing their inherent structure before being fine-tuned for road
extraction. This method improves generalization, enhances robustness to domain
shifts, and is invariant to model architecture and dataset choice. Experiments
show that our pretraining significantly boosts segmentation accuracy,
especially in low-data regimes, making it a scalable solution for aerial image
analysis.
comment: Accepted at 36th IEEE Intelligent Vehicles Symposium (IV) 2025 Joint
Workshop on Safety, Metrics and Benchmarks for Autonomous Driving
♻ ☆ DG-TTA: Out-of-domain Medical Image Segmentation through Augmentation and Descriptor-driven Domain Generalization and Test-Time Adaptation
Purpose: Applying pre-trained medical deep learning segmentation models on
out-of-domain images often yields predictions of insufficient quality. In this
study, we propose to use a powerful generalizing descriptor along with
augmentation to enable domain-generalized pre-training and test-time
adaptation, achieving high-quality segmentation in unseen domains.
Materials and Methods: In this retrospective study five different publicly
available datasets (2012 to 2022) including 3D CT and MRI images are used to
evaluate segmentation performance in out-of-domain scenarios. The settings
include abdominal, spine, and cardiac imaging. The data is randomly split into
training and test samples. Domain-generalized pre-training on source data is
used to obtain the best initial performance in the target domain. We introduce
the combination of the generalizing SSC descriptor and GIN intensity
augmentation for optimal generalization. Segmentation results are subsequently
optimized at test time, where we propose to adapt the pre-trained models for
every unseen scan with a consistency scheme using the same
augmentation-descriptor combination. The segmentation is evaluated using Dice
similarity and Hausdorff distance and the significance of improvements is
tested with the Wilcoxon signed-rank test.
Results: The proposed generalized pre-training and subsequent test-time
adaptation improves model performance significantly in CT to MRI cross-domain
prediction for abdominal (+46.2% and +28.2% Dice), spine (+72.9%), and cardiac
(+14.2% and +55.7% Dice) scenarios (p<0.001).
Conclusion: Our method enables optimal, independent usage of medical image
source and target data and bridges domain gaps successfully with a compact and
efficient methodology. Open-source code available at:
https://github.com/multimodallearning/DG-TTA
♻ ☆ Nonhuman Primate Brain Tissue Segmentation Using a Transfer Learning Approach
Zhen Lin, Hongyu Yuan, Richard Barcus, Qing Lyu, Sucheta Chakravarty, Megan E. Lipford, Carol A. Shively, Suzanne Craft, Mohammad Kawas, Jeongchul Kim, Christopher T. Whitlow
Non-human primates (NHPs) serve as critical models for understanding human
brain function and neurological disorders due to their close evolutionary
relationship with humans. Accurate brain tissue segmentation in NHPs is
critical for understanding neurological disorders, but challenging due to the
scarcity of annotated NHP brain MRI datasets, the small size of the NHP brain,
the limited resolution of available imaging data and the anatomical differences
between human and NHP brains. To address these challenges, we propose a novel
approach utilizing STU-Net with transfer learning to leverage knowledge
transferred from human brain MRI data to enhance segmentation accuracy in the
NHP brain MRI, particularly when training data is limited. The combination of
STU-Net and transfer learning effectively delineates complex tissue boundaries
and captures fine anatomical details specific to NHP brains. Notably, our
method demonstrated improvement in segmenting small subcortical structures such
as putamen and thalamus that are challenging to resolve with limited spatial
resolution and tissue contrast, and achieved DSC of over 0.88, IoU over 0.8 and
HD95 under 7. This study introduces a robust method for multi-class brain
tissue segmentation in NHPs, potentially accelerating research in evolutionary
neuroscience and preclinical studies of neurological disorders relevant to
human health.
♻ ☆ Exploring Scene Affinity for Semi-Supervised LiDAR Semantic Segmentation CVPR2025
This paper explores scene affinity (AIScene), namely intra-scene consistency
and inter-scene correlation, for semi-supervised LiDAR semantic segmentation in
driving scenes. Adopting teacher-student training, AIScene employs a teacher
network to generate pseudo-labeled scenes from unlabeled data, which then
supervise the student network's learning. Unlike most methods that include all
points in pseudo-labeled scenes for forward propagation but only pseudo-labeled
points for backpropagation, AIScene removes points without pseudo-labels,
ensuring consistency in both forward and backward propagation within the scene.
This simple point erasure strategy effectively prevents unsupervised,
semantically ambiguous points (excluded in backpropagation) from affecting the
learning of pseudo-labeled points. Moreover, AIScene incorporates patch-based
data augmentation, mixing multiple scenes at both scene and instance levels.
Compared to existing augmentation techniques that typically perform scene-level
mixing between two scenes, our method enhances the semantic diversity of
labeled (or pseudo-labeled) scenes, thereby improving the semi-supervised
performance of segmentation models. Experiments show that AIScene outperforms
previous methods on two popular benchmarks across four settings, achieving
notable improvements of 1.9% and 2.1% in the most challenging 1% labeled data.
comment: Accepted by CVPR2025
♻ ☆ Introducing the Short-Time Fourier Kolmogorov Arnold Network: A Dynamic Graph CNN Approach for Tree Species Classification in 3D Point Clouds
Said Ohamouddou, Mohamed Ohamouddou, Hanaa El Afia, Abdellatif El Afia, Rafik Lasri, Raddouane Chiheb
Accurate classification of tree species based on Terrestrial Laser Scanning
(TLS) and Airborne Laser Scanning (ALS) is essential for biodiversity
conservation. While advanced deep learning models for 3D point cloud
classification have demonstrated strong performance in this domain, their high
complexity often hinders the development of efficient, low-computation
architectures. In this paper, we introduce STFT-KAN, a novel Kolmogorov-Arnold
network that integrates the Short-Time Fourier Transform (STFT), which can
replace the standard linear layer with activation. We implemented STFT-KAN
within a lightweight version of DGCNN, called liteDGCNN, to classify tree
species using the TLS data. Our experiments show that STFT-KAN outperforms
existing KAN variants by effectively balancing model complexity and performance
with parameter count reduction, achieving competitive results compared to
MLP-based models. Additionally, we evaluated a hybrid architecture that
combines MLP in edge convolution with STFT-KAN in other layers, achieving
comparable performance to MLP models while reducing the parameter count by 50%
and 75% compared to other KAN-based variants. Furthermore, we compared our
model to leading 3D point cloud learning approaches, demonstrating that
STFT-KAN delivers competitive results compared to the state-of-the-art method
PointMLP lite with an 87% reduction in parameter count.
♻ ☆ Exploiting Mixture-of-Experts Redundancy Unlocks Multimodal Generative Abilities
Raman Dutt, Harleen Hanspal, Guoxuan Xia, Petru-Daniel Tudosiu, Alexander Black, Yongxin Yang, Steven McDonagh, Sarah Parisot
In this work, we undertake the challenge of augmenting the existing
generative capabilities of pre-trained text-only large language models (LLMs)
with multi-modal generation capability while satisfying two core constraints:
C1 preserving the preservation of original language generative capabilities
with negligible performance degradation, and C2 adhering to a small parameter
budget to learn the new modality, ensuring scalability and efficiency. In
contrast to current approaches that add dedicated modules, thereby
significantly increasing the parameter count, we propose a method that
leverages the underutilized capacity inherent in deep models. Specifically, we
exploit the parameter redundancy within Mixture-of-Experts (MoEs) as a source
of additional capacity for learning a new modality, enabling better parameter
efficiency (C1). Moreover, we preserve the original language generation
capabilities by applying low-rank adaptation exclusively to the tokens of the
new modality (C2). Furthermore, we introduce a novel parameter initialization
scheme based on the Gromov-Wasserstein distance to improve convergence and
training stability. Through an extensive analysis of the routing mechanism, we
uncover the emergence of modality-specific pathways and decreased redundancy
within the experts that can efficiently unlock multi-modal generative
capabilities. Overall, our method can be seamlessly applied to a wide range of
contemporary LLMs, providing a new pathway for transitioning from uni-modal to
multi-modal architectures.
♻ ☆ DoubleDiffusion: Combining Heat Diffusion with Denoising Diffusion for Texture Generation on 3D Meshes
Xuyang Wang, Ziang Cheng, Zhenyu Li, Jiayu Yang, Haorui Ji, Pan Ji, Mehrtash Harandi, Richard Hartley, Hongdong Li
This paper addresses the problem of generating textures for 3D mesh assets.
Existing approaches often rely on image diffusion models to generate multi-view
image observations, which are then transformed onto the mesh surface to produce
a single texture. However, due to the gap between multi-view images and 3D
space, such process is susceptible to arange of issues such as geometric
inconsistencies, visibility occlusion, and baking artifacts. To overcome this
problem, we propose a novel approach that directly generates texture on 3D
meshes. Our approach leverages heat dissipation diffusion, which serves as an
efficient operator that propagates features on the geometric surface of a mesh,
while remaining insensitive to the specific layout of the wireframe. By
integrating this technique into a generative diffusion pipeline, we
significantly improve the efficiency of texture generation compared to existing
texture generation methods. We term our approach DoubleDiffusion, as it
combines heat dissipation diffusion with denoising diffusion to enable native
generative learning on 3D mesh surfaces.
comment: Codes: https://github.com/Wxyxixixi/DoubleDiffusion_3D_Mesh
♻ ☆ Attention-Guided Multi-scale Interaction Network for Face Super-Resolution
Recently, CNN and Transformer hybrid networks demonstrated excellent
performance in face super-resolution (FSR) tasks. Since numerous features at
different scales in hybrid networks, how to fuse these multi-scale features and
promote their complementarity is crucial for enhancing FSR. However, existing
hybrid network-based FSR methods ignore this, only simply combining the
Transformer and CNN. To address this issue, we propose an attention-guided
Multi-scale interaction network (AMINet), which contains local and global
feature interactions and encoder-decoder phase feature interactions.
Specifically, we propose a Local and Global Feature Interaction Module (LGFI)
to promote fusions of global features and different receptive fields' local
features extracted by our Residual Depth Feature Extraction Module (RDFE).
Additionally, we propose a Selective Kernel Attention Fusion Module (SKAF) to
adaptively select fusions of different features within LGFI and encoder-decoder
phases. Our above design allows the free flow of multi-scale features from
within modules and between encoder and decoder, which can promote the
complementarity of different scale features to enhance FSR. Comprehensive
experiments confirm that our method consistently performs well with less
computational consumption and faster inference.
comment: 13 pages, 11 figures, 10 tables
♻ ☆ UniGS: Modeling Unitary 3D Gaussians for Novel View Synthesis from Sparse-view Images
In this work, we introduce UniGS, a novel 3D Gaussian reconstruction and
novel view synthesis model that predicts a high-fidelity representation of 3D
Gaussians from arbitrary number of posed sparse-view images. Previous methods
often regress 3D Gaussians locally on a per-pixel basis for each view and then
transfer them to world space and merge them through point concatenation. In
contrast, Our approach involves modeling unitary 3D Gaussians in world space
and updating them layer by layer. To leverage information from multi-view
inputs for updating the unitary 3D Gaussians, we develop a DETR (DEtection
TRansformer)-like framework, which treats 3D Gaussians as queries and updates
their parameters by performing multi-view cross-attention (MVDFA) across
multiple input images, which are treated as keys and values. This approach
effectively avoids `ghosting' issue and allocates more 3D Gaussians to complex
regions. Moreover, since the number of 3D Gaussians used as decoder queries is
independent of the number of input views, our method allows arbitrary number of
multi-view images as input without causing memory explosion or requiring
retraining. Extensive experiments validate the advantages of our approach,
showcasing superior performance over existing methods quantitatively (improving
PSNR by 4.2 dB when trained on Objaverse and tested on the GSO benchmark) and
qualitatively. The code will be released at https://github.com/jwubz123/UNIG.
♻ ☆ Image as an IMU: Estimating Camera Motion from a Single Motion-Blurred Image
In many robotics and VR/AR applications, fast camera motions cause a high
level of motion blur, causing existing camera pose estimation methods to fail.
In this work, we propose a novel framework that leverages motion blur as a rich
cue for motion estimation rather than treating it as an unwanted artifact. Our
approach works by predicting a dense motion flow field and a monocular depth
map directly from a single motion-blurred image. We then recover the
instantaneous camera velocity by solving a linear least squares problem under
the small motion assumption. In essence, our method produces an IMU-like
measurement that robustly captures fast and aggressive camera movements. To
train our model, we construct a large-scale dataset with realistic synthetic
motion blur derived from ScanNet++v2 and further refine our model by training
end-to-end on real data using our fully differentiable pipeline. Extensive
evaluations on real-world benchmarks demonstrate that our method achieves
state-of-the-art angular and translational velocity estimates, outperforming
current methods like MASt3R and COLMAP.
comment: Project page: https://jerredchen.github.io/image-as-imu/
♻ ☆ Think or Not Think: A Study of Explicit Thinking inRule-Based Visual Reinforcement Fine-Tuning
This paper investigates rule-based reinforcement learning (RL) fine-tuning
for visual classification using multi-modal large language models (MLLMs) and
the role of the thinking process. We begin by exploring \textit{CLS-RL}, a
method that leverages verifiable signals as rewards to encourage MLLMs to
'think' before classifying. Our experiments across \textbf{eleven} datasets
demonstrate that CLS-RL achieves significant improvements over supervised
fine-tuning (SFT) in both base-to-new generalization and few-shot learning
scenarios. Notably, we observe a 'free-lunch' phenomenon where fine-tuning on
one dataset unexpectedly enhances performance on others, suggesting that RL
effectively teaches fundamental classification skills. However, we question
whether the explicit thinking, a critical aspect of rule-based RL, is always
beneficial or indispensable. Challenging the conventional assumption that
complex reasoning enhances performance, we introduce \textit{No-Thinking-RL}, a
novel approach that minimizes the model's thinking during fine-tuning by
utilizing an equality accuracy reward. Our experiments reveal that
No-Thinking-RL achieves superior in-domain performance and generalization
capabilities compared to CLS-RL, while requiring significantly less fine-tuning
time. This underscores that, contrary to prevailing assumptions, reducing the
thinking process can lead to more efficient and effective MLLM fine-tuning for
some visual tasks. Furthermore, No-Thinking-RL demonstrates enhanced
performance on other visual benchmarks, such as a 6.4\% improvement on CVBench.
We hope our findings provides insights into the impact of thinking in RL-based
fine-tuning.
comment: Preprint, work in progress. Add results on CVBench
♻ ☆ PhyT2V: LLM-Guided Iterative Self-Refinement for Physics-Grounded Text-to-Video Generation
Text-to-video (T2V) generation has been recently enabled by transformer-based
diffusion models, but current T2V models lack capabilities in adhering to the
real-world common knowledge and physical rules, due to their limited
understanding of physical realism and deficiency in temporal modeling. Existing
solutions are either data-driven or require extra model inputs, but cannot be
generalizable to out-of-distribution domains. In this paper, we present PhyT2V,
a new data-independent T2V technique that expands the current T2V model's
capability of video generation to out-of-distribution domains, by enabling
chain-of-thought and step-back reasoning in T2V prompting. Our experiments show
that PhyT2V improves existing T2V models' adherence to real-world physical
rules by 2.3x, and achieves 35% improvement compared to T2V prompt enhancers.
The source codes are available at: https://github.com/pittisl/PhyT2V.
comment: 28 pages
♻ ☆ FisherTune: Fisher-Guided Robust Tuning of Vision Foundation Models for Domain Generalized Segmentation
Vision Foundation Models (VFMs) excel in generalization due to large-scale
pretraining, but fine-tuning them for Domain Generalized Semantic Segmentation
(DGSS) while maintaining this ability remains challenging. Existing approaches
either selectively fine-tune parameters or freeze the VFMs and update only the
adapters, both of which may underutilize the VFMs' full potential in DGSS
tasks. We observe that domain-sensitive parameters in VFMs, arising from task
and distribution differences, can hinder generalization. To address this, we
propose \textbf{FisherTune}, a robust fine-tuning method guided by the
Domain-Related Fisher Information Matrix (DR-FIM). DR-FIM measures parameter
sensitivity across tasks and domains, enabling selective updates that preserve
generalization and enhance DGSS adaptability. FisherTune incorporates
variational inference to stabilize DR-FIM estimation, treating parameters as
Gaussian-distributed variables and leveraging pre-trained priors. Extensive
experiments show that FisherTune achieves superior cross-domain segmentation
while maintaining generalization, outperforming selective-parameter and
adapter-based methods.
♻ ☆ Lie Detector: Unified Backdoor Detection via Cross-Examination Framework
Xuan Wang, Siyuan Liang, Dongping Liao, Han Fang, Aishan Liu, Xiaochun Cao, Yu-liang Lu, Ee-Chien Chang, Xitong Gao
Institutions with limited data and computing resources often outsource model
training to third-party providers in a semi-honest setting, assuming adherence
to prescribed training protocols with pre-defined learning paradigm (e.g.,
supervised or semi-supervised learning). However, this practice can introduce
severe security risks, as adversaries may poison the training data to embed
backdoors into the resulting model. Existing detection approaches predominantly
rely on statistical analyses, which often fail to maintain universally accurate
detection accuracy across different learning paradigms. To address this
challenge, we propose a unified backdoor detection framework in the semi-honest
setting that exploits cross-examination of model inconsistencies between two
independent service providers. Specifically, we integrate central kernel
alignment to enable robust feature similarity measurements across different
model architectures and learning paradigms, thereby facilitating precise
recovery and identification of backdoor triggers. We further introduce backdoor
fine-tuning sensitivity analysis to distinguish backdoor triggers from
adversarial perturbations, substantially reducing false positives. Extensive
experiments demonstrate that our method achieves superior detection
performance, improving accuracy by 5.4%, 1.6%, and 11.9% over SoTA baselines
across supervised, semi-supervised, and autoregressive learning tasks,
respectively. Notably, it is the first to effectively detect backdoors in
multimodal large language models, further highlighting its broad applicability
and advancing secure deep learning.
♻ ☆ An End-to-End Robust Point Cloud Semantic Segmentation Network with Single-Step Conditional Diffusion Models
Existing conditional Denoising Diffusion Probabilistic Models (DDPMs) with a
Noise-Conditional Framework (NCF) remain challenging for 3D scene understanding
tasks, as the complex geometric details in scenes increase the difficulty of
fitting the gradients of the data distribution (the scores) from semantic
labels. This also results in longer training and inference time for DDPMs
compared to non-DDPMs. From a different perspective, we delve deeply into the
model paradigm dominated by the Conditional Network. In this paper, we propose
an end-to-end robust semantic Segmentation Network based on a Conditional-Noise
Framework (CNF) of DDPMs, named CDSegNet. Specifically, CDSegNet models the
Noise Network (NN) as a learnable noise-feature generator. This enables the
Conditional Network (CN) to understand 3D scene semantics under multi-level
feature perturbations, enhancing the generalization in unseen scenes.
Meanwhile, benefiting from the noise system of DDPMs, CDSegNet exhibits strong
noise and sparsity robustness in experiments. Moreover, thanks to CNF, CDSegNet
can generate the semantic labels in a single-step inference like non-DDPMs, due
to avoiding directly fitting the scores from semantic labels in the dominant
network of CDSegNet. On public indoor and outdoor benchmarks, CDSegNet
significantly outperforms existing methods, achieving state-of-the-art
performance.
♻ ☆ OncoReg: Medical Image Registration for Oncological Challenges
Wiebke Heyer, Yannic Elser, Lennart Berkel, Xinrui Song, Xuanang Xu, Pingkun Yan, Xi Jia, Jinming Duan, Zi Li, Tony C. W. Mok, BoWen LI, Christian Staackmann, Christoph Großbröhmer, Lasse Hansen, Alessa Hering, Malte M. Sieren, Mattias P. Heinrich
In modern cancer research, the vast volume of medical data generated is often
underutilised due to challenges related to patient privacy. The OncoReg
Challenge addresses this issue by enabling researchers to develop and validate
image registration methods through a two-phase framework that ensures patient
privacy while fostering the development of more generalisable AI models. Phase
one involves working with a publicly available dataset, while phase two focuses
on training models on a private dataset within secure hospital networks.
OncoReg builds upon the foundation established by the Learn2Reg Challenge by
incorporating the registration of interventional cone-beam computed tomography
(CBCT) with standard planning fan-beam CT (FBCT) images in radiotherapy.
Accurate image registration is crucial in oncology, particularly for dynamic
treatment adjustments in image-guided radiotherapy, where precise alignment is
necessary to minimise radiation exposure to healthy tissues while effectively
targeting tumours. This work details the methodology and data behind the
OncoReg Challenge and provides a comprehensive analysis of the competition
entries and results. Findings reveal that feature extraction plays a pivotal
role in this registration task. A new method emerging from this challenge
demonstrated its versatility, while established approaches continue to perform
comparably to newer techniques. Both deep learning and classical approaches
still play significant roles in image registration, with the combination of
methods - particularly in feature extraction - proving most effective.
comment: 26 pages, 6 figures
♻ ☆ MambaPEFT: Exploring Parameter-Efficient Fine-Tuning for Mamba ICLR2025
An ecosystem of Transformer-based models has been established by building
large models with extensive data. Parameter-efficient fine-tuning (PEFT) is a
crucial technology for deploying these models to downstream tasks with minimal
cost while achieving effective performance. Recently, Mamba, a State Space
Model (SSM)-based model, has attracted attention as a potential alternative to
Transformers. While many large-scale Mamba-based models have been proposed,
efficiently adapting pre-trained Mamba-based models to downstream tasks remains
unexplored. In this paper, we conduct an exploratory analysis of PEFT methods
for Mamba. We investigate the effectiveness of existing PEFT methods for
Transformers when applied to Mamba. We also modify these methods to better
align with the Mamba architecture. Additionally, we propose new Mamba-specific
PEFT methods that leverage the distinctive structure of Mamba. Our experiments
indicate that PEFT performs more effectively for Mamba than Transformers.
Lastly, we demonstrate how to effectively combine multiple PEFT methods and
provide a framework that outperforms previous works. To ensure reproducibility,
we will release the code after publication.
comment: Accepted to ICLR2025
♻ ☆ Stable-Makeup: When Real-World Makeup Transfer Meets Diffusion Model
Current makeup transfer methods are limited to simple makeup styles, making
them difficult to apply in real-world scenarios. In this paper, we introduce
Stable-Makeup, a novel diffusion-based makeup transfer method capable of
robustly transferring a wide range of real-world makeup, onto user-provided
faces. Stable-Makeup is based on a pre-trained diffusion model and utilizes a
Detail-Preserving (D-P) makeup encoder to encode makeup details. It also
employs content and structural control modules to preserve the content and
structural information of the source image. With the aid of our newly added
makeup cross-attention layers in U-Net, we can accurately transfer the detailed
makeup to the corresponding position in the source image. After
content-structure decoupling training, Stable-Makeup can maintain content and
the facial structure of the source image. Moreover, our method has demonstrated
strong robustness and generalizability, making it applicable to varioustasks
such as cross-domain makeup transfer, makeup-guided text-to-image generation
and so on. Extensive experiments have demonstrated that our approach delivers
state-of-the-art (SOTA) results among existing makeup transfer methods and
exhibits a highly promising with broad potential applications in various
related fields. Code released: https://github.com/Xiaojiu-z/Stable-Makeup
♻ ☆ Local Information Matters: Inference Acceleration For Grounded Conversation Generation Models Through Adaptive Local-Aware Token Pruning
Grounded Conversation Generation (GCG) is an emerging vision-language task
that requires models to generate natural language responses seamlessly
intertwined with corresponding object segmentation masks. Recent models, such
as GLaMM and OMG-LLaVA, achieve pixel-level grounding but incur significant
computational costs due to processing a large number of visual tokens. Existing
token pruning methods, like FastV and PyramidDrop, fail to preserve the local
visual features critical for accurate grounding, leading to substantial
performance drops in GCG tasks. To address this, we propose Adaptive
Local-Aware Token Pruning (ALTP), a simple yet effective framework that
accelerates GCG models by prioritizing local object information. ALTP
introduces two key components: (1) Detail Density Capture (DDC), which uses
superpixel segmentation to retain tokens in object-centric regions, preserving
fine-grained details, and (2) Dynamic Density Formation (DDF), which
dynamically allocates tokens based on information density, ensuring higher
retention in semantically rich areas. Extensive experiments on the GranDf
dataset demonstrate that ALTP significantly outperforms existing token pruning
methods, such as FastV and PyramidDrop, on both GLaMM and OMG-LLaVA models.
Notably, when applied to GLaMM, ALTP achieves a 90% reduction in visual tokens
with a 4.9% improvement in AP50 and a 5.0% improvement in Recall compared to
PyramidDrop. Similarly, on OMG-LLaVA, ALTP improves AP by 2.1% and mIOU by 3.0%
at a 90% token reduction compared with PDrop.
♻ ☆ Mr. DETR: Instructive Multi-Route Training for Detection Transformers CVPR 2025
Existing methods enhance the training of detection transformers by
incorporating an auxiliary one-to-many assignment. In this work, we treat the
model as a multi-task framework, simultaneously performing one-to-one and
one-to-many predictions. We investigate the roles of each component in the
transformer decoder across these two training targets, including
self-attention, cross-attention, and feed-forward network. Our empirical
results demonstrate that any independent component in the decoder can
effectively learn both targets simultaneously, even when other components are
shared. This finding leads us to propose a multi-route training mechanism,
featuring a primary route for one-to-one prediction and two auxiliary training
routes for one-to-many prediction. We enhance the training mechanism with a
novel instructive self-attention that dynamically and flexibly guides object
queries for one-to-many prediction. The auxiliary routes are removed during
inference, ensuring no impact on model architecture or inference cost. We
conduct extensive experiments on various baselines, achieving consistent
improvements as shown in Figure 1. Project page:
https://visual-ai.github.io/mrdetr
comment: Accepted by CVPR 2025, Project page:
https://visual-ai.github.io/mrdetr
♻ ☆ ControlSR: Taming Diffusion Models for Consistent Real-World Image Super Resolution
We present ControlSR, a new method that can tame Diffusion Models for
consistent real-world image super-resolution (Real-ISR). Previous Real-ISR
models mostly focus on how to activate more generative priors of text-to-image
diffusion models to make the output high-resolution (HR) images look better.
However, since these methods rely too much on the generative priors, the
content of the output images is often inconsistent with the input LR ones. To
mitigate the above issue, in this work, we tame Diffusion Models by effectively
utilizing LR information to impose stronger constraints on the control signals
from ControlNet in the latent space. We show that our method can produce
higher-quality control signals, which enables the super-resolution results to
be more consistent with the LR image and leads to clearer visual results. In
addition, we also propose an inference strategy that imposes constraints in the
latent space using LR information, allowing for the simultaneous improvement of
fidelity and generative ability. Experiments demonstrate that our model can
achieve better performance across multiple metrics on several test sets and
generate more consistent SR results with LR images than existing methods. Our
code is available at https://github.com/HVision-NKU/ControlSR.
♻ ☆ StarGen: A Spatiotemporal Autoregression Framework with Video Diffusion Model for Scalable and Controllable Scene Generation
Shangjin Zhai, Zhichao Ye, Jialin Liu, Weijian Xie, Jiaqi Hu, Zhen Peng, Hua Xue, Danpeng Chen, Xiaomeng Wang, Lei Yang, Nan Wang, Haomin Liu, Guofeng Zhang
Recent advances in large reconstruction and generative models have
significantly improved scene reconstruction and novel view generation. However,
due to compute limitations, each inference with these large models is confined
to a small area, making long-range consistent scene generation challenging. To
address this, we propose StarGen, a novel framework that employs a pre-trained
video diffusion model in an autoregressive manner for long-range scene
generation. The generation of each video clip is conditioned on the 3D warping
of spatially adjacent images and the temporally overlapping image from
previously generated clips, improving spatiotemporal consistency in long-range
scene generation with precise pose control. The spatiotemporal condition is
compatible with various input conditions, facilitating diverse tasks, including
sparse view interpolation, perpetual view generation, and layout-conditioned
city generation. Quantitative and qualitative evaluations demonstrate StarGen's
superior scalability, fidelity, and pose accuracy compared to state-of-the-art
methods. Project page: https://zju3dv.github.io/StarGen.
♻ ☆ AnyTouch: Learning Unified Static-Dynamic Representation across Multiple Visuo-tactile Sensors ICLR 2025
Visuo-tactile sensors aim to emulate human tactile perception, enabling
robots to precisely understand and manipulate objects. Over time, numerous
meticulously designed visuo-tactile sensors have been integrated into robotic
systems, aiding in completing various tasks. However, the distinct data
characteristics of these low-standardized visuo-tactile sensors hinder the
establishment of a powerful tactile perception system. We consider that the key
to addressing this issue lies in learning unified multi-sensor representations,
thereby integrating the sensors and promoting tactile knowledge transfer
between them. To achieve unified representation of this nature, we introduce
TacQuad, an aligned multi-modal multi-sensor tactile dataset from four
different visuo-tactile sensors, which enables the explicit integration of
various sensors. Recognizing that humans perceive the physical environment by
acquiring diverse tactile information such as texture and pressure changes, we
further propose to learn unified multi-sensor representations from both static
and dynamic perspectives. By integrating tactile images and videos, we present
AnyTouch, a unified static-dynamic multi-sensor representation learning
framework with a multi-level structure, aimed at both enhancing comprehensive
perceptual abilities and enabling effective cross-sensor transfer. This
multi-level architecture captures pixel-level details from tactile data via
masked modeling and enhances perception and transferability by learning
semantic-level sensor-agnostic features through multi-modal alignment and
cross-sensor matching. We provide a comprehensive analysis of multi-sensor
transferability, and validate our method on various datasets and in the
real-world pouring task. Experimental results show that our method outperforms
existing methods, exhibits outstanding static and dynamic perception
capabilities across various sensors.
comment: Accepted by ICLR 2025
♻ ☆ RainyGS: Efficient Rain Synthesis with Physically-Based Gaussian Splatting CVPR 2025
We consider the problem of adding dynamic rain effects to in-the-wild scenes
in a physically-correct manner. Recent advances in scene modeling have made
significant progress, with NeRF and 3DGS techniques emerging as powerful tools
for reconstructing complex scenes. However, while effective for novel view
synthesis, these methods typically struggle with challenging scene editing
tasks, such as physics-based rain simulation. In contrast, traditional
physics-based simulations can generate realistic rain effects, such as
raindrops and splashes, but they often rely on skilled artists to carefully set
up high-fidelity scenes. This process lacks flexibility and scalability,
limiting its applicability to broader, open-world environments. In this work,
we introduce RainyGS, a novel approach that leverages the strengths of both
physics-based modeling and 3DGS to generate photorealistic, dynamic rain
effects in open-world scenes with physical accuracy. At the core of our method
is the integration of physically-based raindrop and shallow water simulation
techniques within the fast 3DGS rendering framework, enabling realistic and
efficient simulations of raindrop behavior, splashes, and reflections. Our
method supports synthesizing rain effects at over 30 fps, offering users
flexible control over rain intensity -- from light drizzles to heavy downpours.
We demonstrate that RainyGS performs effectively for both real-world outdoor
scenes and large-scale driving scenarios, delivering more photorealistic and
physically-accurate rain effects compared to state-of-the-art methods. Project
page can be found at https://pku-vcl-geometry.github.io/RainyGS/
comment: CVPR 2025
♻ ☆ VFX Creator: Animated Visual Effect Generation with Controllable Diffusion Transformer
Crafting magic and illusions is one of the most thrilling aspects of
filmmaking, with visual effects (VFX) serving as the powerhouse behind
unforgettable cinematic experiences. While recent advances in generative
artificial intelligence have driven progress in generic image and video
synthesis, the domain of controllable VFX generation remains relatively
underexplored. In this work, we propose a novel paradigm for animated VFX
generation as image animation, where dynamic effects are generated from
user-friendly textual descriptions and static reference images. Our work makes
two primary contributions: (i) Open-VFX, the first high-quality VFX video
dataset spanning 15 diverse effect categories, annotated with textual
descriptions, instance segmentation masks for spatial conditioning, and
start-end timestamps for temporal control. (ii) VFX Creator, a simple yet
effective controllable VFX generation framework based on a Video Diffusion
Transformer. The model incorporates a spatial and temporal controllable LoRA
adapter, requiring minimal training videos. Specifically, a plug-and-play mask
control module enables instance-level spatial manipulation, while tokenized
start-end motion timestamps embedded in the diffusion process, alongside the
text encoder, allow precise temporal control over effect timing and pace.
Extensive experiments on the Open-VFX test set demonstrate the superiority of
the proposed system in generating realistic and dynamic effects, achieving
state-of-the-art performance and generalization ability in both spatial and
temporal controllability. Furthermore, we introduce a specialized metric to
evaluate the precision of temporal control. By bridging traditional VFX
techniques with generative approaches, VFX Creator unlocks new possibilities
for efficient and high-quality video effect generation, making advanced VFX
accessible to a broader audience.
♻ ☆ GaussianRoom: Improving 3D Gaussian Splatting with SDF Guidance and Monocular Cues for Indoor Scene Reconstruction
Haodong Xiang, Xinghui Li, Kai Cheng, Xiansong Lai, Wanting Zhang, Zhichao Liao, Long Zeng, Xueping Liu
Embodied intelligence requires precise reconstruction and rendering to
simulate large-scale real-world data. Although 3D Gaussian Splatting (3DGS) has
recently demonstrated high-quality results with real-time performance, it still
faces challenges in indoor scenes with large, textureless regions, resulting in
incomplete and noisy reconstructions due to poor point cloud initialization and
underconstrained optimization. Inspired by the continuity of signed distance
field (SDF), which naturally has advantages in modeling surfaces, we propose a
unified optimization framework that integrates neural signed distance fields
(SDFs) with 3DGS for accurate geometry reconstruction and real-time rendering.
This framework incorporates a neural SDF field to guide the densification and
pruning of Gaussians, enabling Gaussians to model scenes accurately even with
poor initialized point clouds. Simultaneously, the geometry represented by
Gaussians improves the efficiency of the SDF field by piloting its point
sampling. Additionally, we introduce two regularization terms based on normal
and edge priors to resolve geometric ambiguities in textureless areas and
enhance detail accuracy. Extensive experiments in ScanNet and ScanNet++ show
that our method achieves state-of-the-art performance in both surface
reconstruction and novel view synthesis.
♻ ☆ Content-decoupled Contrastive Learning-based Implicit Degradation Modeling for Blind Image Super-Resolution
Implicit degradation modeling-based blind super-resolution (SR) has attracted
more increasing attention in the community due to its excellent generalization
to complex degradation scenarios and wide application range. How to extract
more discriminative degradation representations and fully adapt them to
specific image features is the key to this task. In this paper, we propose a
new Content-decoupled Contrastive Learning-based blind image super-resolution
(CdCL) framework following the typical blind SR pipeline. This framework
introduces negative-free contrastive learning technique for the first time to
model the implicit degradation representation, in which a new cyclic shift
sampling strategy is designed to ensure decoupling between content features and
degradation features from the data perspective, thereby improving the purity
and discriminability of the learned implicit degradation space. In addition, we
propose a detail-aware implicit degradation adapting module that can better
adapt degradation representations to specific LR features by enhancing the
basic adaptation unit's perception of image details, significantly reducing the
overall SR model complexity. Extensive experiments on synthetic and real data
show that our method achieves highly competitive quantitative and qualitative
results in various degradation settings while obviously reducing parameters and
computational costs, validating the feasibility of designing practical and
lightweight blind SR tools.
♻ ☆ Video-T1: Test-Time Scaling for Video Generation
With the scale capability of increasing training data, model size, and
computational cost, video generation has achieved impressive results in digital
creation, enabling users to express creativity across various domains.
Recently, researchers in Large Language Models (LLMs) have expanded the scaling
to test-time, which can significantly improve LLM performance by using more
inference-time computation. Instead of scaling up video foundation models
through expensive training costs, we explore the power of Test-Time Scaling
(TTS) in video generation, aiming to answer the question: if a video generation
model is allowed to use non-trivial amount of inference-time compute, how much
can it improve generation quality given a challenging text prompt. In this
work, we reinterpret the test-time scaling of video generation as a searching
problem to sample better trajectories from Gaussian noise space to the target
video distribution. Specifically, we build the search space with test-time
verifiers to provide feedback and heuristic algorithms to guide searching
process. Given a text prompt, we first explore an intuitive linear search
strategy by increasing noise candidates at inference time. As full-step
denoising all frames simultaneously requires heavy test-time computation costs,
we further design a more efficient TTS method for video generation called
Tree-of-Frames (ToF) that adaptively expands and prunes video branches in an
autoregressive manner. Extensive experiments on text-conditioned video
generation benchmarks demonstrate that increasing test-time compute
consistently leads to significant improvements in the quality of videos.
Project page: https://liuff19.github.io/Video-T1
comment: Project page: https://liuff19.github.io/Video-T1
♻ ☆ Vision-Language Models for Acute Tuberculosis Diagnosis: A Multimodal Approach Combining Imaging and Clinical Data
Ananya Ganapthy, Praveen Shastry, Naveen Kumarasami, Anandakumar D, Keerthana R, Mounigasri M, Varshinipriya M, Kishore Prasath Venkatesh, Bargava Subramanian, Kalyan Sivasailam
Background: This study introduces a Vision-Language Model (VLM) leveraging
SIGLIP and Gemma-3b architectures for automated acute tuberculosis (TB)
screening. By integrating chest X-ray images and clinical notes, the model aims
to enhance diagnostic accuracy and efficiency, particularly in resource-limited
settings.
Methods: The VLM combines visual data from chest X-rays with clinical context
to generate detailed, context-aware diagnostic reports. The architecture
employs SIGLIP for visual encoding and Gemma-3b for decoding, ensuring
effective representation of acute TB-specific pathologies and clinical
insights.
Results: Key acute TB pathologies, including consolidation, cavities, and
nodules, were detected with high precision (97percent) and recall (96percent).
The model demonstrated strong spatial localization capabilities and robustness
in distinguishing TB-positive cases, making it a reliable tool for acute TB
diagnosis.
Conclusion: The multimodal capability of the VLM reduces reliance on
radiologists, providing a scalable solution for acute TB screening. Future work
will focus on improving the detection of subtle pathologies and addressing
dataset biases to enhance its generalizability and application in diverse
global healthcare settings.
comment: 11 pages, 3 figures
♻ ☆ Generalizable Prompt Learning of CLIP: A Brief Overview
Existing vision-language models (VLMs) such as CLIP have showcased an
impressive capability to generalize well across various downstream tasks. These
models leverage the synergy between visual and textual information, enabling
them to understand and reason about the content present in images and text in a
unified manner. This article provides a brief overview of CLIP based on
few-shot prompt learning, including experimental data and technical
characteristics of some methods. The purpose of this review is to provide a
reference for researchers who have just started their research in generalizable
prompting of CLIP through few-shot training for classification across 15
datasets and also to facilitate the integration of this field by researchers in
other downstream tasks.
♻ ☆ ALLVB: All-in-One Long Video Understanding Benchmark AAAI 2025
From image to video understanding, the capabilities of Multi-modal LLMs
(MLLMs) are increasingly powerful. However, most existing video understanding
benchmarks are relatively short, which makes them inadequate for effectively
evaluating the long-sequence modeling capabilities of MLLMs. This highlights
the urgent need for a comprehensive and integrated long video understanding
benchmark to assess the ability of MLLMs thoroughly. To this end, we propose
ALLVB (ALL-in-One Long Video Understanding Benchmark). ALLVB's main
contributions include: 1) It integrates 9 major video understanding tasks.
These tasks are converted into video QA formats, allowing a single benchmark to
evaluate 9 different video understanding capabilities of MLLMs, highlighting
the versatility, comprehensiveness, and challenging nature of ALLVB. 2) A fully
automated annotation pipeline using GPT-4o is designed, requiring only human
quality control, which facilitates the maintenance and expansion of the
benchmark. 3) It contains 1,376 videos across 16 categories, averaging nearly 2
hours each, with a total of 252k QAs. To the best of our knowledge, it is the
largest long video understanding benchmark in terms of the number of videos,
average duration, and number of QAs. We have tested various mainstream MLLMs on
ALLVB, and the results indicate that even the most advanced commercial models
have significant room for improvement. This reflects the benchmark's
challenging nature and demonstrates the substantial potential for development
in long video understanding.
comment: AAAI 2025
♻ ☆ Zero-Shot Visual Concept Blending Without Text Guidance
We propose a novel, zero-shot image generation technique called "Visual
Concept Blending" that provides fine-grained control over which features from
multiple reference images are transferred to a source image. If only a single
reference image is available, it is difficult to isolate which specific
elements should be transferred. However, using multiple reference images, the
proposed approach distinguishes between common and unique features by
selectively incorporating them into a generated output. By operating within a
partially disentangled Contrastive Language-Image Pre-training (CLIP) embedding
space (from IP-Adapter), our method enables the flexible transfer of texture,
shape, motion, style, and more abstract conceptual transformations without
requiring additional training or text prompts. We demonstrate its effectiveness
across a diverse range of tasks, including style transfer, form metamorphosis,
and conceptual transformations, showing how subtle or abstract attributes
(e.g., brushstroke style, aerodynamic lines, and dynamism) can be seamlessly
combined into a new image. In a user study, participants accurately recognized
which features were intended to be transferred. Its simplicity, flexibility,
and high-level control make Visual Concept Blending valuable for creative
fields such as art, design, and content creation, where combining specific
visual qualities from multiple inspirations is crucial.
♻ ☆ Feed-Forward Bullet-Time Reconstruction of Dynamic Scenes from Monocular Videos
Hanxue Liang, Jiawei Ren, Ashkan Mirzaei, Antonio Torralba, Ziwei Liu, Igor Gilitschenski, Sanja Fidler, Cengiz Oztireli, Huan Ling, Zan Gojcic, Jiahui Huang
Recent advancements in static feed-forward scene reconstruction have
demonstrated significant progress in high-quality novel view synthesis.
However, these models often struggle with generalizability across diverse
environments and fail to effectively handle dynamic content. We present BTimer
(short for BulletTimer), the first motion-aware feed-forward model for
real-time reconstruction and novel view synthesis of dynamic scenes. Our
approach reconstructs the full scene in a 3D Gaussian Splatting representation
at a given target ('bullet') timestamp by aggregating information from all the
context frames. Such a formulation allows BTimer to gain scalability and
generalization by leveraging both static and dynamic scene datasets. Given a
casual monocular dynamic video, BTimer reconstructs a bullet-time scene within
150ms while reaching state-of-the-art performance on both static and dynamic
scene datasets, even compared with optimization-based approaches.
comment: Project website:
https://research.nvidia.com/labs/toronto-ai/bullet-timer/
♻ ☆ Diffusion Models in 3D Vision: A Survey
In recent years, 3D vision has become a crucial field within computer vision,
powering a wide range of applications such as autonomous driving, robotics,
augmented reality, and medical imaging. This field relies on accurate
perception, understanding, and reconstruction of 3D scenes from 2D images or
text data sources. Diffusion models, originally designed for 2D generative
tasks, offer the potential for more flexible, probabilistic methods that can
better capture the variability and uncertainty present in real-world 3D data.
In this paper, we review the state-of-the-art methods that use diffusion models
for 3D visual tasks, including but not limited to 3D object generation, shape
completion, point-cloud reconstruction, and scene construction. We provide an
in-depth discussion of the underlying mathematical principles of diffusion
models, outlining their forward and reverse processes, as well as the various
architectural advancements that enable these models to work with 3D datasets.
We also discuss the key challenges in applying diffusion models to 3D vision,
such as handling occlusions and varying point densities, and the computational
demands of high-dimensional data. Finally, we discuss potential solutions,
including improving computational efficiency, enhancing multimodal fusion, and
exploring the use of large-scale pretraining for better generalization across
3D tasks. This paper serves as a foundation for future exploration and
development in this rapidly evolving field.
♻ ☆ Unveiling the Mist over 3D Vision-Language Understanding: Object-centric Evaluation with Chain-of-Analysis CVPR 2025
Jiangyong Huang, Baoxiong Jia, Yan Wang, Ziyu Zhu, Xiongkun Linghu, Qing Li, Song-Chun Zhu, Siyuan Huang
Existing 3D vision-language (3D-VL) benchmarks fall short in evaluating 3D-VL
models, creating a "mist" that obscures rigorous insights into model
capabilities and 3D-VL tasks. This mist persists due to three key limitations.
First, flawed test data, like ambiguous referential text in the grounding task,
can yield incorrect and unreliable test results. Second, oversimplified metrics
such as simply averaging accuracy per question answering (QA) pair, cannot
reveal true model capability due to their vulnerability to language variations.
Third, existing benchmarks isolate the grounding and QA tasks, disregarding the
underlying coherence that QA should be based on solid grounding capabilities.
To unveil the "mist", we propose Beacon3D, a benchmark for 3D-VL grounding and
QA tasks, delivering a perspective shift in the evaluation of 3D-VL
understanding. Beacon3D features (i) high-quality test data with precise and
natural language, (ii) object-centric evaluation with multiple tests per object
to ensure robustness, and (iii) a novel chain-of-analysis paradigm to address
language robustness and model performance coherence across grounding and QA.
Our evaluation of state-of-the-art 3D-VL models on Beacon3D reveals that (i)
object-centric evaluation elicits true model performance and particularly weak
generalization in QA; (ii) grounding-QA coherence remains fragile in current
3D-VL models, and (iii) incorporating large language models (LLMs) to 3D-VL
models, though as a prevalent practice, hinders grounding capabilities and has
yet to elevate QA capabilities. We hope Beacon3D and our comprehensive analysis
could benefit the 3D-VL community towards faithful developments.
comment: CVPR 2025. Project page: https://beacon-3d.github.io
♻ ☆ MagicPose4D: Crafting Articulated Models with Appearance and Motion Control
With the success of 2D and 3D visual generative models, there is growing
interest in generating 4D content. Existing methods primarily rely on text
prompts to produce 4D content, but they often fall short of accurately defining
complex or rare motions. To address this limitation, we propose MagicPose4D, a
novel framework for refined control over both appearance and motion in 4D
generation. Unlike current 4D generation methods, MagicPose4D accepts monocular
videos or mesh sequences as motion prompts, enabling precise and customizable
motion control. MagicPose4D comprises two key modules: (i) Dual-Phase 4D
Reconstruction Module, which operates in two phases. The first phase focuses on
capturing the model's shape using accurate 2D supervision and less accurate but
geometrically informative 3D pseudo-supervision without imposing skeleton
constraints. The second phase extracts the 3D motion (skeleton poses) using
more accurate pseudo-3D supervision, obtained in the first phase and introduces
kinematic chain-based skeleton constraints to ensure physical plausibility.
Additionally, we propose a Global-local Chamfer loss that aligns the overall
distribution of predicted mesh vertices with the supervision while maintaining
part-level alignment without extra annotations. (ii) Cross-category Motion
Transfer Module, which leverages the extracted motion from the 4D
reconstruction module and uses a kinematic-chain-based skeleton to achieve
cross-category motion transfer. It ensures smooth transitions between frames
through dynamic rigidity, facilitating robust generalization without additional
training. Through extensive experiments, we demonstrate that MagicPose4D
significantly improves the accuracy and consistency of 4D content generation,
outperforming existing methods in various benchmarks.
comment: Project Page: https://magicpose4d.github.io/
♻ ☆ Phase-shifted remote photoplethysmography for estimating heart rate and blood pressure from facial video
Human health can be critically affected by cardiovascular diseases, such as
hypertension, arrhythmias, and stroke. Heart rate and blood pressure are
important biometric information for the monitoring of cardiovascular system and
early diagnosis of cardiovascular diseases. Existing methods for estimating the
heart rate are based on electrocardiography and photoplethyomography, which
require contacting the sensor to the skin surface. Moreover, catheter and
cuff-based methods for measuring blood pressure cause inconvenience and have
limited applicability. Therefore, in this thesis, we propose a vision-based
method for estimating the heart rate and blood pressure. This thesis proposes a
2-stage deep learning framework consisting of a dual remote
photoplethysmography network (DRP-Net) and bounded blood pressure network
(BBP-Net). In the first stage, DRP-Net infers remote photoplethysmography
(rPPG) signals for the acral and facial regions, and these phase-shifted rPPG
signals are utilized to estimate the heart rate. In the second stage, BBP-Net
integrates temporal features and analyzes phase discrepancy between the acral
and facial rPPG signals to estimate SBP and DBP values. To improve the accuracy
of estimating the heart rate, we employed a data augmentation method based on a
frame interpolation model. Moreover, we designed BBP-Net to infer blood
pressure within a predefined range by incorporating a scaled sigmoid function.
Our method resulted in estimating the heart rate with the mean absolute error
(MAE) of 1.78 BPM, reducing the MAE by 34.31 % compared to the recent method,
on the MMSE-HR dataset. The MAE for estimating the systolic blood pressure
(SBP) and diastolic blood pressure (DBP) were 10.19 mmHg and 7.09 mmHg. On the
V4V dataset, the MAE for the heart rate, SBP, and DBP were 3.83 BPM, 13.64
mmHg, and 9.4 mmHg, respectively.
comment: 13 pages, 10 figures
♻ ☆ VRM: Knowledge Distillation via Virtual Relation Matching
Knowledge distillation (KD) aims to transfer the knowledge of a more capable
yet cumbersome teacher model to a lightweight student model. In recent years,
relation-based KD methods have fallen behind, as their instance-matching
counterparts dominate in performance. In this paper, we revive relational KD by
identifying and tackling several key issues in relation-based methods,
including their susceptibility to overfitting and spurious responses.
Specifically, we transfer novelly constructed affinity graphs that compactly
encapsulate a wealth of beneficial inter-sample, inter-class, and inter-view
correlations by exploiting virtual views and relations as a new kind of
knowledge. As a result, the student has access to richer guidance signals and
stronger regularisation throughout the distillation process. To further
mitigate the adverse impact of spurious responses, we prune the affinity graphs
by dynamically detaching redundant and unreliable edges. Extensive experiments
on CIFAR-100 and ImageNet datasets demonstrate the superior performance of the
proposed virtual relation matching (VRM) method over a range of models,
architectures, and set-ups. For instance, VRM for the first time hits 74.0%
accuracy for ResNet50-to-MobileNetV2 distillation on ImageNet, and improves
DeiT-T by 14.44% on CIFAR-100 with a ResNet56 teacher. Thorough analyses are
also conducted to gauge the soundness, properties, and complexity of our
designs. Code and models will be released.
♻ ☆ Controllable Human Image Generation with Personalized Multi-Garments CVPR 2025
We present BootComp, a novel framework based on text-to-image diffusion
models for controllable human image generation with multiple reference
garments. Here, the main bottleneck is data acquisition for training:
collecting a large-scale dataset of high-quality reference garment images per
human subject is quite challenging, i.e., ideally, one needs to manually gather
every single garment photograph worn by each human. To address this, we propose
a data generation pipeline to construct a large synthetic dataset, consisting
of human and multiple-garment pairs, by introducing a model to extract any
reference garment images from each human image. To ensure data quality, we also
propose a filtering strategy to remove undesirable generated data based on
measuring perceptual similarities between the garment presented in human image
and extracted garment. Finally, by utilizing the constructed synthetic dataset,
we train a diffusion model having two parallel denoising paths that use
multiple garment images as conditions to generate human images while preserving
their fine-grained details. We further show the wide-applicability of our
framework by adapting it to different types of reference-based generation in
the fashion domain, including virtual try-on, and controllable human image
generation with other conditions, e.g., pose, face, etc.
comment: CVPR 2025. Project page: https://omnious.github.io/BootComp
♻ ☆ VisRL: Intention-Driven Visual Perception via Reinforced Reasoning
Visual understanding is inherently intention-driven - humans selectively
focus on different regions of a scene based on their goals. Recent advances in
large multimodal models (LMMs) enable flexible expression of such intentions
through natural language, allowing queries to guide visual reasoning processes.
Frameworks like Visual Chain-of-Thought have demonstrated the benefit of
incorporating explicit reasoning steps, where the model predicts a focus region
before answering a query. However, existing approaches rely heavily on
supervised training with annotated intermediate bounding boxes, which severely
limits scalability due to the combinatorial explosion of intention-region
pairs. To overcome this limitation, we propose VisRL, the first framework that
applies reinforcement learning (RL) to the problem of intention-driven visual
perception. VisRL optimizes the entire visual reasoning process using only
reward signals. By treating intermediate focus selection as an internal
decision optimized through trial-and-error, our method eliminates the need for
costly region annotations while aligning more closely with how humans learn to
perceive the world. Extensive experiments across multiple benchmarks show that
VisRL consistently outperforms strong baselines, demonstrating both its
effectiveness and its strong generalization across different LMMs. Our code is
available at https://github.com/zhangquanchen/VisRL.
comment: 18pages,11 figures
♻ ☆ Diffusion State-Guided Projected Gradient for Inverse Problems ICLR 2025
Recent advancements in diffusion models have been effective in learning data
priors for solving inverse problems. They leverage diffusion sampling steps for
inducing a data prior while using a measurement guidance gradient at each step
to impose data consistency. For general inverse problems, approximations are
needed when an unconditionally trained diffusion model is used since the
measurement likelihood is intractable, leading to inaccurate posterior
sampling. In other words, due to their approximations, these methods fail to
preserve the generation process on the data manifold defined by the diffusion
prior, leading to artifacts in applications such as image restoration. To
enhance the performance and robustness of diffusion models in solving inverse
problems, we propose Diffusion State-Guided Projected Gradient (DiffStateGrad),
which projects the measurement gradient onto a subspace that is a low-rank
approximation of an intermediate state of the diffusion process. DiffStateGrad,
as a module, can be added to a wide range of diffusion-based inverse solvers to
improve the preservation of the diffusion process on the prior manifold and
filter out artifact-inducing components. We highlight that DiffStateGrad
improves the robustness of diffusion models in terms of the choice of
measurement guidance step size and noise while improving the worst-case
performance. Finally, we demonstrate that DiffStateGrad improves upon the
state-of-the-art on linear and nonlinear image restoration inverse problems.
Our code is available at https://github.com/Anima-Lab/DiffStateGrad.
comment: Published as a conference paper at ICLR 2025. RZ and BT have equal
contributions
♻ ☆ Retrieval-augmented Few-shot Medical Image Segmentation with Foundation Models
Medical image segmentation is crucial for clinical decision-making, but the
scarcity of annotated data presents significant challenges. Few-shot
segmentation (FSS) methods show promise but often require training on the
target domain and struggle to generalize across different modalities.
Similarly, adapting foundation models like the Segment Anything Model (SAM) for
medical imaging has limitations, including the need for finetuning and
domain-specific adaptation. To address these issues, we propose a novel method
that adapts DINOv2 and Segment Anything Model 2 (SAM 2) for retrieval-augmented
few-shot medical image segmentation. Our approach uses DINOv2's feature as
query to retrieve similar samples from limited annotated data, which are then
encoded as memories and stored in memory bank. With the memory attention
mechanism of SAM 2, the model leverages these memories as conditions to
generate accurate segmentation of the target image. We evaluated our framework
on three medical image segmentation tasks, demonstrating superior performance
and generalizability across various modalities without the need for any
retraining or finetuning. Overall, this method offers a practical and effective
solution for few-shot medical image segmentation and holds significant
potential as a valuable annotation tool in clinical applications.
♻ ☆ VideoMind: A Chain-of-LoRA Agent for Long Video Reasoning
Videos, with their unique temporal dimension, demand precise grounded
understanding, where answers are directly linked to visual, interpretable
evidence. Despite significant breakthroughs in reasoning capabilities within
Large Language Models, multi-modal reasoning - especially for videos - remains
unexplored. In this work, we introduce VideoMind, a novel video-language agent
designed for temporal-grounded video understanding. VideoMind incorporates two
key innovations: (i) We identify essential capabilities for video temporal
reasoning and develop a role-based agentic workflow, including a planner for
coordinating different roles, a grounder for temporal localization, a verifier
to assess temporal interval accuracy, and an answerer for question-answering.
(ii) To efficiently integrate these diverse roles, we propose a novel
Chain-of-LoRA strategy, enabling seamless role-switching via lightweight LoRA
adaptors while avoiding the overhead of multiple models, thus balancing
efficiency and flexibility. Extensive experiments on 14 public benchmarks,
including 3 on grounded video question-answering (Grounded VideoQA), 6 on video
temporal grounding (VTG), and 5 on general video question-answering (VideoQA),
verify that our agent achieves state-of-the-art performance on diverse video
understanding tasks, underscoring its effectiveness in advancing video agent
and long-form temporal reasoning.
comment: Project Page: https://videomind.github.io/
♻ ☆ Without Paired Labeled Data: An End-to-End Self-Supervised Paradigm for UAV-View Geo-Localization
UAV-View Geo-Localization (UVGL) aims to achieve accurate localization of
unmanned aerial vehicles (UAVs) by retrieving the most relevant GPS-tagged
satellite images. However, existing methods heavily rely on pre-paired
UAV-satellite images for supervised learning. Such dependency not only incurs
high annotation costs but also severely limits scalability and practical
deployment in open-world UVGL scenarios. To address these limitations, we
propose an end-to-end self-supervised UVGL method. Our method leverages a
shallow backbone network to extract initial features, employs clustering to
generate pseudo labels, and adopts a dual-path contrastive learning
architecture to learn discriminative intra-view representations. Furthermore,
our method incorporates two core modules, the dynamic hierarchical memory
learning module and the information consistency evolution learning module. The
dynamic hierarchical memory learning module combines short-term and long-term
memory to enhance intra-view feature consistency and discriminability.
Meanwhile, the information consistency evolution learning module leverages a
neighborhood-driven dynamic constraint mechanism to systematically capture
implicit cross-view semantic correlations, thereby improving cross-view feature
alignment. To further stabilize and strengthen the self-supervised training
process, a pseudo-label enhancement strategy is introduced, which refines the
quality of pseudo supervision. Our method ultimately constructs a unified
cross-view feature representation space under self-supervised settings.
Extensive experiments on three public benchmark datasets demonstrate that the
proposed method consistently outperforms existing self-supervised methods and
even surpasses several state-of-the-art supervised methods. Our code is
available at https://github.com/ISChenawei/DMNIL.
♻ ☆ HumanDreamer: Generating Controllable Human-Motion Videos via Decoupled Generation
Boyuan Wang, Xiaofeng Wang, Chaojun Ni, Guosheng Zhao, Zhiqin Yang, Zheng Zhu, Muyang Zhang, Yukun Zhou, Xinze Chen, Guan Huang, Lihong Liu, Xingang Wang
Human-motion video generation has been a challenging task, primarily due to
the difficulty inherent in learning human body movements. While some approaches
have attempted to drive human-centric video generation explicitly through pose
control, these methods typically rely on poses derived from existing videos,
thereby lacking flexibility. To address this, we propose HumanDreamer, a
decoupled human video generation framework that first generates diverse poses
from text prompts and then leverages these poses to generate human-motion
videos. Specifically, we propose MotionVid, the largest dataset for
human-motion pose generation. Based on the dataset, we present MotionDiT, which
is trained to generate structured human-motion poses from text prompts.
Besides, a novel LAMA loss is introduced, which together contribute to a
significant improvement in FID by 62.4%, along with respective enhancements in
R-precision for top1, top2, and top3 by 41.8%, 26.3%, and 18.3%, thereby
advancing both the Text-to-Pose control accuracy and FID metrics. Our
experiments across various Pose-to-Video baselines demonstrate that the poses
generated by our method can produce diverse and high-quality human-motion
videos. Furthermore, our model can facilitate other downstream tasks, such as
pose sequence prediction and 2D-3D motion lifting.
comment: Project Page: https://humandreamer.github.io
♻ ☆ Data-Free Group-Wise Fully Quantized Winograd Convolution via Learnable Scales CVPR 2025
Despite the revolutionary breakthroughs of large-scale text-to-image
diffusion models for complex vision and downstream tasks, their extremely high
computational and storage costs limit their usability. Quantization of
diffusion models has been explored in recent works to reduce compute costs and
memory bandwidth usage. To further improve inference time, fast convolution
algorithms such as Winograd can be used for convolution layers, which account
for a significant portion of computations in diffusion models. However, the
significant quality loss of fully quantized Winograd using existing
coarser-grained post-training quantization methods, combined with the
complexity and cost of finetuning the Winograd transformation matrices for such
large models to recover quality, makes them unsuitable for large-scale
foundation models. Motivated by the presence of a large range of values in
them, we investigate the impact of finer-grained group-wise quantization in
quantizing diffusion models. While group-wise quantization can largely handle
the fully quantized Winograd convolution, it struggles to deal with the large
distribution imbalance in a sizable portion of the Winograd domain computation.
To reduce range differences in the Winograd domain, we propose finetuning only
the scale parameters of the Winograd transform matrices without using any
domain-specific training data. Because our method does not depend on any
training data, the generalization performance of quantized diffusion models is
safely guaranteed. For text-to-image generation task, the 8-bit fully-quantized
diffusion model with Winograd provides near-lossless quality (FID and CLIP
scores) in comparison to the full-precision model. For image classification,
our method outperforms the state-of-the-art Winograd PTQ method by 1.62% and
2.56% in top-1 ImageNet accuracy on ResNet18 and ResNet-34, respectively, with
Winograd F(6, 3).
comment: Accepted by CVPR 2025
♻ ☆ Visual Acoustic Fields
Yuelei Li, Hyunjin Kim, Fangneng Zhan, Ri-Zhao Qiu, Mazeyu Ji, Xiaojun Shan, Xueyan Zou, Paul Liang, Hanspeter Pfister, Xiaolong Wang
Objects produce different sounds when hit, and humans can intuitively infer
how an object might sound based on its appearance and material properties.
Inspired by this intuition, we propose Visual Acoustic Fields, a framework that
bridges hitting sounds and visual signals within a 3D space using 3D Gaussian
Splatting (3DGS). Our approach features two key modules: sound generation and
sound localization. The sound generation module leverages a conditional
diffusion model, which takes multiscale features rendered from a
feature-augmented 3DGS to generate realistic hitting sounds. Meanwhile, the
sound localization module enables querying the 3D scene, represented by the
feature-augmented 3DGS, to localize hitting positions based on the sound
sources. To support this framework, we introduce a novel pipeline for
collecting scene-level visual-sound sample pairs, achieving alignment between
captured images, impact locations, and corresponding sounds. To the best of our
knowledge, this is the first dataset to connect visual and acoustic signals in
a 3D context. Extensive experiments on our dataset demonstrate the
effectiveness of Visual Acoustic Fields in generating plausible impact sounds
and accurately localizing impact sources. Our project page is at
https://yuelei0428.github.io/projects/Visual-Acoustic-Fields/.
♻ ☆ Astrea: A MOE-based Visual Understanding Model with Progressive Alignment
Xiaoda Yang, JunYu Lu, Hongshun Qiu, Sijing Li, Hao Li, Shengpeng Ji, Xudong Tang, Jiayang Xu, Jiaqi Duan, Ziyue Jiang, Cong Lin, Sihang Cai, Zejian Xie, Zhuoyang Song, Songxin Zhang
Vision-Language Models (VLMs) based on Mixture-of-Experts (MoE) architectures
have emerged as a pivotal paradigm in multimodal understanding, offering a
powerful framework for integrating visual and linguistic information. However,
the increasing complexity and diversity of tasks present significant challenges
in coordinating load balancing across heterogeneous visual experts, where
optimizing one specialist's performance often compromises others' capabilities.
To address task heterogeneity and expert load imbalance, we propose Astrea, a
novel multi-expert collaborative VLM architecture based on progressive
pre-alignment. Astrea introduces three key innovations: 1) A heterogeneous
expert coordination mechanism that integrates four specialized models
(detection, segmentation, classification, captioning) into a comprehensive
expert matrix covering essential visual comprehension elements; 2) A dynamic
knowledge fusion strategy featuring progressive pre-alignment to harmonize
experts within the VLM latent space through contrastive learning, complemented
by probabilistically activated stochastic residual connections to preserve
knowledge continuity; 3) An enhanced optimization framework utilizing momentum
contrastive learning for long-range dependency modeling and adaptive weight
allocators for real-time expert contribution calibration. Extensive evaluations
across 12 benchmark tasks spanning VQA, image captioning, and cross-modal
retrieval demonstrate Astrea's superiority over state-of-the-art models,
achieving an average performance gain of +4.7\%. This study provides the first
empirical demonstration that progressive pre-alignment strategies enable VLMs
to overcome task heterogeneity limitations, establishing new methodological
foundations for developing general-purpose multimodal agents.
♻ ☆ 4D LangSplat: 4D Language Gaussian Splatting via Multimodal Large Language Models CVPR 2025
Wanhua Li, Renping Zhou, Jiawei Zhou, Yingwei Song, Johannes Herter, Minghan Qin, Gao Huang, Hanspeter Pfister
Learning 4D language fields to enable time-sensitive, open-ended language
queries in dynamic scenes is essential for many real-world applications. While
LangSplat successfully grounds CLIP features into 3D Gaussian representations,
achieving precision and efficiency in 3D static scenes, it lacks the ability to
handle dynamic 4D fields as CLIP, designed for static image-text tasks, cannot
capture temporal dynamics in videos. Real-world environments are inherently
dynamic, with object semantics evolving over time. Building a precise 4D
language field necessitates obtaining pixel-aligned, object-wise video
features, which current vision models struggle to achieve. To address these
challenges, we propose 4D LangSplat, which learns 4D language fields to handle
time-agnostic or time-sensitive open-vocabulary queries in dynamic scenes
efficiently. 4D LangSplat bypasses learning the language field from vision
features and instead learns directly from text generated from object-wise video
captions via Multimodal Large Language Models (MLLMs). Specifically, we propose
a multimodal object-wise video prompting method, consisting of visual and text
prompts that guide MLLMs to generate detailed, temporally consistent,
high-quality captions for objects throughout a video. These captions are
encoded using a Large Language Model into high-quality sentence embeddings,
which then serve as pixel-aligned, object-specific feature supervision,
facilitating open-vocabulary text queries through shared embedding spaces.
Recognizing that objects in 4D scenes exhibit smooth transitions across states,
we further propose a status deformable network to model these continuous
changes over time effectively. Our results across multiple benchmarks
demonstrate that 4D LangSplat attains precise and efficient results for both
time-sensitive and time-agnostic open-vocabulary queries.
comment: CVPR 2025. Project Page: https://4d-langsplat.github.io
♻ ☆ Learned Image Compression and Restoration for Digital Pathology
SeonYeong Lee, EonSeung Seong, DongEon Lee, SiYeoul Lee, Yubin Cho, Chunsu Park, Seonho Kim, MinKyung Seo, YoungSin Ko, MinWoo Kim
Digital pathology images play a crucial role in medical diagnostics, but
their ultra-high resolution and large file sizes pose significant challenges
for storage, transmission, and real-time visualization. To address these
issues, we propose CLERIC, a novel deep learning-based image compression
framework designed specifically for whole slide images (WSIs). CLERIC
integrates a learnable lifting scheme and advanced convolutional techniques to
enhance compression efficiency while preserving critical pathological details.
Our framework employs a lifting-scheme transform in the analysis stage to
decompose images into low- and high-frequency components, enabling more
structured latent representations. These components are processed through
parallel encoders incorporating Deformable Residual Blocks (DRB) and Recurrent
Residual Blocks (R2B) to improve feature extraction and spatial adaptability.
The synthesis stage applies an inverse lifting transform for effective image
reconstruction, ensuring high-fidelity restoration of fine-grained tissue
structures. We evaluate CLERIC on a digital pathology image dataset and compare
its performance against state-of-the-art learned image compression (LIC)
models. Experimental results demonstrate that CLERIC achieves superior
rate-distortion (RD) performance, significantly reducing storage requirements
while maintaining high diagnostic image quality. Our study highlights the
potential of deep learning-based compression in digital pathology, facilitating
efficient data management and long-term storage while ensuring seamless
integration into clinical workflows and AI-assisted diagnostic systems. Code
and models are available at: https://github.com/pnu-amilab/CLERIC.
♻ ☆ TextCrafter: Accurately Rendering Multiple Texts in Complex Visual Scenes
This paper explores the task of Complex Visual Text Generation (CVTG), which
centers on generating intricate textual content distributed across diverse
regions within visual images. In CVTG, image generation models often rendering
distorted and blurred visual text or missing some visual text. To tackle these
challenges, we propose TextCrafter, a novel multi-visual text rendering method.
TextCrafter employs a progressive strategy to decompose complex visual text
into distinct components while ensuring robust alignment between textual
content and its visual carrier. Additionally, it incorporates a token focus
enhancement mechanism to amplify the prominence of visual text during the
generation process. TextCrafter effectively addresses key challenges in CVTG
tasks, such as text confusion, omissions, and blurriness. Moreover, we present
a new benchmark dataset, CVTG-2K, tailored to rigorously evaluate the
performance of generative models on CVTG tasks. Extensive experiments
demonstrate that our method surpasses state-of-the-art approaches.
♻ ☆ EventMamba: Enhancing Spatio-Temporal Locality with State Space Models for Event-Based Video Reconstruction
Leveraging its robust linear global modeling capability, Mamba has notably
excelled in computer vision. Despite its success, existing Mamba-based vision
models have overlooked the nuances of event-driven tasks, especially in video
reconstruction. Event-based video reconstruction (EBVR) demands spatial
translation invariance and close attention to local event relationships in the
spatio-temporal domain. Unfortunately, conventional Mamba algorithms apply
static window partitions and standard reshape scanning methods, leading to
significant losses in local connectivity. To overcome these limitations, we
introduce EventMamba--a specialized model designed for EBVR tasks. EventMamba
innovates by incorporating random window offset (RWO) in the spatial domain,
moving away from the restrictive fixed partitioning. Additionally, it features
a new consistent traversal serialization approach in the spatio-temporal
domain, which maintains the proximity of adjacent events both spatially and
temporally. These enhancements enable EventMamba to retain Mamba's robust
modeling capabilities while significantly preserving the spatio-temporal
locality of event data. Comprehensive testing on multiple datasets shows that
EventMamba markedly enhances video reconstruction, drastically improving
computation speed while delivering superior visual quality compared to
Transformer-based methods.
♻ ☆ Where am I? Cross-View Geo-localization with Natural Language Descriptions
Cross-view geo-localization identifies the locations of street-view images by
matching them with geo-tagged satellite images or OSM. However, most existing
studies focus on image-to-image retrieval, with fewer addressing text-guided
retrieval, a task vital for applications like pedestrian navigation and
emergency response. In this work, we introduce a novel task for cross-view
geo-localization with natural language descriptions, which aims to retrieve
corresponding satellite images or OSM database based on scene text
descriptions. To support this task, we construct the CVG-Text dataset by
collecting cross-view data from multiple cities and employing a scene text
generation approach that leverages the annotation capabilities of Large
Multimodal Models to produce high-quality scene text descriptions with
localization details. Additionally, we propose a novel text-based retrieval
localization method, CrossText2Loc, which improves recall by 10% and
demonstrates excellent long-text retrieval capabilities. In terms of
explainability, it not only provides similarity scores but also offers
retrieval reasons. More information can be found at
https://yejy53.github.io/CVG-Text/ .
comment: 11 pages, 6 figures
♻ ☆ Alleviating Hallucinations in Large Vision-Language Models through Hallucination-Induced Optimization NeurIPS 2024
Although Large Visual Language Models (LVLMs) have demonstrated exceptional
abilities in understanding multimodal data, they invariably suffer from
hallucinations, leading to a disconnect between the generated text and the
corresponding images. Almost all current visual contrastive decoding methods
attempt to mitigate these hallucinations by introducing visual uncertainty
information that appropriately widens the contrastive logits gap between
hallucinatory and targeted ones. However, due to uncontrollable nature of the
global visual uncertainty, they struggle to precisely induce the hallucinatory
tokens, which severely limits their effectiveness in mitigating hallucinations
and may even lead to the generation of undesired hallucinations. To tackle this
issue, we conducted the theoretical analysis to promote the effectiveness of
contrast decoding. Building on this insight, we introduce a novel optimization
strategy named Hallucination-Induced Optimization (HIO). This strategy seeks to
amplify the contrast between hallucinatory and targeted tokens relying on a
fine-tuned theoretical preference model (i.e., Contrary Bradley-Terry Model),
thereby facilitating efficient contrast decoding to alleviate hallucinations in
LVLMs. Extensive experimental research demonstrates that our HIO strategy can
effectively reduce hallucinations in LVLMs, outperforming state-of-the-art
methods across various benchmarks.
comment: Accepted by NeurIPS 2024
♻ ☆ On-device Sora: Enabling Training-Free Diffusion-based Text-to-Video Generation for Mobile Devices
We present On-device Sora, the first model training-free solution for
diffusion-based on-device text-to-video generation that operates efficiently on
smartphone-grade devices. To address the challenges of diffusion-based
text-to-video generation on computation- and memory-limited mobile devices, the
proposed On-device Sora applies three novel techniques to pre-trained video
generative models. First, Linear Proportional Leap (LPL) reduces the excessive
denoising steps required in video diffusion through an efficient leap-based
approach. Second, Temporal Dimension Token Merging (TDTM) minimizes intensive
token-processing computation in attention layers by merging consecutive tokens
along the temporal dimension. Third, Concurrent Inference with Dynamic Loading
(CI-DL) dynamically partitions large models into smaller blocks and loads them
into memory for concurrent model inference, effectively addressing the
challenges of limited device memory. We implement On-device Sora on the iPhone
15 Pro, and the experimental evaluations show that it is capable of generating
high-quality videos on the device, comparable to those produced by high-end
GPUs. These results show that On-device Sora enables efficient and high-quality
video generation on resource-constrained mobile devices. We envision the
proposed On-device Sora as a significant first step toward democratizing
state-of-the-art generative technologies, enabling video generation on
commodity mobile and embedded devices without resource-intensive re-training
for model optimization (compression). The code implementation is available at a
GitHub repository(https://github.com/eai-lab/On-device-Sora).
comment: Replicated Submission. arXiv:2502.04363 submitted as second version
of the paper
♻ ☆ A novel algorithm for optimizing bundle adjustment in image sequence alignment
The Bundle Adjustment (BA) model is commonly optimized using a nonlinear
least squares method, with the Levenberg-Marquardt (L-M) algorithm being a
typical choice. However, despite the L-M algorithm's effectiveness, its
sensitivity to initial conditions often results in slower convergence when
applied to poorly conditioned datasets, motivating the exploration of
alternative optimization strategies. This paper introduces a novel algorithm
for optimizing the BA model in the context of image sequence alignment for
cryo-electron tomography, utilizing optimal control theory to directly optimize
general nonlinear functions. The proposed Optimal Control Algorithm (OCA)
exhibits superior convergence rates and effectively mitigates the oscillatory
behavior frequently observed in L-M algorithm. Extensive experiments on both
synthetic and real-world datasets were conducted to evaluate the algorithm's
performance. The results demonstrate that the OCA achieves faster convergence
compared to the L-M algorithm. Moreover, the incorporation of a bisection-based
update procedure significantly enhances the OCA's performance, particularly in
poorly initialized datasets. These findings indicate that the OCA can
substantially improve the efficiency of 3D reconstructions in cryo-electron
tomography.
♻ ☆ WaveFormer: A 3D Transformer with Wavelet-Driven Feature Representation for Efficient Medical Image Segmentation
Md Mahfuz Al Hasan, Mahdi Zaman, Abdul Jawad, Alberto Santamaria-Pang, Ho Hin Lee, Ivan Tarapov, Kyle See, Md Shah Imran, Antika Roy, Yaser Pourmohammadi Fallah, Navid Asadizanjani, Reza Forghani
Transformer-based architectures have advanced medical image analysis by
effectively modeling long-range dependencies, yet they often struggle in 3D
settings due to substantial memory overhead and insufficient capture of
fine-grained local features. We address these limitations with WaveFormer, a
novel 3D-transformer that: i) leverages the fundamental frequency-domain
properties of features for contextual representation, and ii) is inspired by
the top-down mechanism of the human visual recognition system, making it a
biologically motivated architecture. By employing discrete wavelet
transformations (DWT) at multiple scales, WaveFormer preserves both global
context and high-frequency details while replacing heavy upsampling layers with
efficient wavelet-based summarization and reconstruction. This significantly
reduces the number of parameters, which is critical for real-world deployment
where computational resources and training times are constrained. Furthermore,
the model is generic and easily adaptable to diverse applications. Evaluations
on BraTS2023, FLARE2021, and KiTS2023 demonstrate performance on par with
state-of-the-art methods while offering substantially lower computational
complexity.
♻ ☆ A Comparative Tutorial of the Histogram-based Image Segmentation Methods
The histogram of an image is the accurate graphical representation of the
numerical grayscale distribution and it is also an estimate of the probability
distribution of image pixels. Therefore, histogram has been widely adopted to
calculate the clustering means and partitioning thresholds for image
segmentation. There have been many classical histogram-based image segmentation
methods proposed and played important roles in both academics and industry. In
this tutorial, the histories and recent advances of the histogram-based image
segmentation techniques are first reviewed and then they are divided into four
categories: (1) the means-based method, (2) the Gaussian-mixture-model-based
method, (3) the entropy-based method and (4) the feature-points-based method.
The purpose of this tutorial is threefold: 1) to teach the principles of the
classical histogram-based image segmentation methods to the interested readers;
2) to evaluate the advantages and disadvantages of these classical
histogram-based image segmentation methods objectively; 3) to compare the
performances of these classical histogram-based image segmentation methods with
state-of-the-art deep learning based methods objectively.
♻ ☆ LVAgent: Long Video Understanding by Multi-Round Dynamical Collaboration of MLLM Agents
Existing Multimodal Large Language Models (MLLMs) encounter significant
challenges in modeling the temporal context within long videos. Currently,
mainstream Agent-based methods use external tools (e.g., search engine, memory
banks, OCR, retrieval models) to assist a single MLLM in answering long video
questions. Despite such tool-based support, a solitary MLLM still offers only a
partial understanding of long videos, resulting in limited performance. In
order to better address long video tasks, we introduce LVAgent, the first
framework enabling multi-round dynamic collaboration of MLLM agents in long
video understanding. Our methodology consists of four key steps: 1. Selection:
We pre-select appropriate agents from the model library to form optimal agent
teams based on different tasks. 2. Perception: We design an effective retrieval
scheme for long videos, improving the coverage of critical temporal segments
while maintaining computational efficiency. 3. Action: Agents answer long
video-related questions and exchange reasons. 4. Reflection: We evaluate the
performance of each agent in each round of discussion and optimize the agent
team for dynamic collaboration. The agents iteratively refine their answers by
multi-round dynamical collaboration of MLLM agents. LVAgent is the first agent
system method that outperforms all closed-source models (including GPT-4o) and
open-source models (including InternVL-2.5 and Qwen2-VL) in the long video
understanding tasks. Our LVAgent achieves an accuracy of 80% on four mainstream
long video understanding tasks. Notably, on the LongVideoBench dataset, LVAgent
improves accuracy by up to 13.3% compared with SOTA.
♻ ☆ RG-Attn: Radian Glue Attention for Multi-modality Multi-agent Cooperative Perception
Cooperative perception offers an optimal solution to overcome the perception
limitations of single-agent systems by leveraging Vehicle-to-Everything (V2X)
communication for data sharing and fusion across multiple agents. However, most
existing approaches focus on single-modality data exchange, limiting the
potential of both homogeneous and heterogeneous fusion across agents. This
overlooks the opportunity to utilize multi-modality data per agent, restricting
the system's performance. In the automotive industry, manufacturers adopt
diverse sensor configurations, resulting in heterogeneous combinations of
sensor modalities across agents. To harness the potential of every possible
data source for optimal performance, we design a robust LiDAR and camera
cross-modality fusion module, Radian-Glue-Attention (RG-Attn), applicable to
both intra-agent cross-modality fusion and inter-agent cross-modality fusion
scenarios, owing to the convenient coordinate conversion by transformation
matrix and the unified sampling/inversion mechanism. We also propose two
different architectures, named Paint-To-Puzzle (PTP) and
Co-Sketching-Co-Coloring (CoS-CoCo), for conducting cooperative perception. PTP
aims for maximum precision performance and achieves smaller data packet size by
limiting cross-agent fusion to a single instance, but requiring all
participants to be equipped with LiDAR. In contrast, CoS-CoCo supports agents
with any configuration-LiDAR-only, camera-only, or LiDAR-camera-both,
presenting more generalization ability. Our approach achieves state-of-the-art
(SOTA) performance on both real and simulated cooperative perception datasets.
The code is now available at GitHub.
♻ ☆ PSF-4D: A Progressive Sampling Framework for View Consistent 4D Editing
Instruction-guided generative models, especially those using text-to-image
(T2I) and text-to-video (T2V) diffusion frameworks, have advanced the field of
content editing in recent years. To extend these capabilities to 4D scene, we
introduce a progressive sampling framework for 4D editing (PSF-4D) that ensures
temporal and multi-view consistency by intuitively controlling the noise
initialization during forward diffusion. For temporal coherence, we design a
correlated Gaussian noise structure that links frames over time, allowing each
frame to depend meaningfully on prior frames. Additionally, to ensure spatial
consistency across views, we implement a cross-view noise model, which uses
shared and independent noise components to balance commonalities and distinct
details among different views. To further enhance spatial coherence, PSF-4D
incorporates view-consistent iterative refinement, embedding view-aware
information into the denoising process to ensure aligned edits across frames
and views. Our approach enables high-quality 4D editing without relying on
external models, addressing key challenges in previous methods. Through
extensive evaluation on multiple benchmarks and multiple editing aspects (e.g.,
style transfer, multi-attribute editing, object removal, local editing, etc.),
we show the effectiveness of our proposed method. Experimental results
demonstrate that our proposed method outperforms state-of-the-art 4D editing
methods in diverse benchmarks.
comment: 9 pages, 7 figures
♻ ☆ Rerouting Connection: Hybrid Computer Vision Analysis Reveals Visual Similarity Between Indus and Tibetan-Yi Corridor Writing Systems
This thesis employs a hybrid CNN-Transformer architecture, in conjunction
with a detailed anthropological framework, to investigate potential historical
connections between the visual morphology of the Indus Valley script and
pictographic systems of the Tibetan-Yi Corridor. Through an ensemble
methodology of three target scripts across 15 independently trained models, we
demonstrate that Tibetan-Yi Corridor scripts exhibit approximately six-fold
higher visual similarity to the Indus script (61.7%-63.5%) than to the Bronze
Age Proto-Cuneiform (10.2%-10.9%) or Proto-Elamite (7.6%-8.7%) systems.
Additionally and contrarily to our current understanding of the networks of the
Indus Valley Civilization, the Indus script unexpectedly maps closer to
Tibetan-Yi Corridor scripts, with a mean cosine similarity of 0.629, than to
the aforementioned contemporaneous West Asian signaries, both of which recorded
mean cosine similarities of 0.104 and 0.080 despite their close geographic
proximity and evident trade relations. Across various dimensionality reduction
practices and clustering methodologies, the Indus script consistently clusters
closest to Tibetan-Yi Corridor scripts. Our computational results align with
qualitative observations of specific pictorial parallels in numeral systems,
gender markers, and key iconographic elements; this is further supported by
archaeological evidence of sustained contact networks along the ancient
Shu-Shendu road in tandem with the Indus Valley Civilization's decline,
providing a plausible transmission pathway. While alternative explanations
cannot be ruled out, the specificity and consistency of observed similarities
challenge conventional narratives of isolated script development and suggest
more complex ancient cultural transmission networks between South and East Asia
than previously recognized.
comment: 106 pages (42 main text, 6 references, 58 appendices). 21 figures, 4
tables in main text; 106 figures, 8 tables total. Code:
https://github.com/oohalakkadi/ivc2tyc. Undergraduate thesis at Duke Kunshan
University. Accepted for presentation at the 52nd International Conference
for Computer Applications & Quantitative Methods in Archaeology (CAA 2025),
Athens, Greece
♻ ☆ Devils in Middle Layers of Large Vision-Language Models: Interpreting, Detecting and Mitigating Object Hallucinations via Attention Lens
Hallucinations in Large Vision-Language Models (LVLMs) significantly
undermine their reliability, motivating researchers to explore the causes of
hallucination. However, most studies primarily focus on the language aspect
rather than the visual. In this paper, we address how LVLMs process visual
information and whether this process causes hallucination. Firstly, we use the
attention lens to identify the stages at which LVLMs handle visual data,
discovering that the middle layers are crucial. Moreover, we find that these
layers can be further divided into two stages: ''visual information
enrichment'' and ''semantic refinement'' which respectively propagate visual
data to object tokens and interpret it through text. By analyzing attention
patterns during the visual information enrichment stage, we find that real
tokens consistently receive higher attention weights than hallucinated ones,
serving as a strong indicator of hallucination. Further examination of
multi-head attention maps reveals that hallucination tokens often result from
heads interacting with inconsistent objects. Based on these insights, we
propose a simple inference-time method that adjusts visual attention by
integrating information across various heads. Extensive experiments demonstrate
that this approach effectively mitigates hallucinations in mainstream LVLMs
without additional training costs. Code is available at
https://github.com/ZhangqiJiang07/middle_layers_indicating_hallucinations.
♻ ☆ View-Invariant Pixelwise Anomaly Detection in Multi-object Scenes with Adaptive View Synthesis
Visual anomaly detection in the built environment is a valuable tool for
applications such as infrastructure assessment, construction monitoring,
security surveillance, and urban planning. Anomaly detection approaches are
typically unsupervised and work by detecting deviations from an expected state
where no assumptions are made exact type of deviation. Unsupervised pixel-level
anomaly detection methods have been developed to successfully recognize and
segment anomalies; however, existing techniques are designed for industrial
settings with a fixed camera position. In the built environment, images are
periodically captured by a camera operated manually or mounted on aerial or
ground vehicles. The camera pose between successive collections may vary widely
voiding a fundamental assumption in existing anomaly detection approaches. To
address this gap, we introduce the problem of Scene Anomaly Detection (Scene
AD), where the goal is to detect anomalies from two sets of images: one set
without anomalies and one set that may or may not contain anomalies. No labeled
semantic segmentation data are provided for training. We propose a novel
network, OmniAD, to tackle Scene AD by refining the reverse distillation
anomaly detection method, leading to a 40\% improvement in pixel-level anomaly
detection. Additionally, we introduce two new data augmentation strategies that
leverage novel view synthesis and camera localization to enhance
generalization. We evaluate our approach both qualitatively and quantitatively
on a new dataset, ToyCity the first Scene AD dataset featuring multiple objects
as well as on the established single object centric dataset, MAD. Our method
demonstrates marked improvement over baseline approaches, paving the way for
robust anomaly detection in scenes with real-world camera pose variations
commonly observed in the built environment. https://drags99.github.io/OmniAD/
♻ ☆ Convolutional Neural Networks Can (Meta-)Learn the Same-Different Relation
While convolutional neural networks (CNNs) have come to match and exceed
human performance in many settings, the tasks these models optimize for are
largely constrained to the level of individual objects, such as classification
and captioning. Humans remain vastly superior to CNNs in visual tasks involving
relations, including the ability to identify two objects as `same' or
`different'. A number of studies have shown that while CNNs can be coaxed into
learning the same-different relation in some settings, they tend to generalize
poorly to other instances of this relation. In this work we show that the same
CNN architectures that fail to generalize the same-different relation with
conventional training are able to succeed when trained via meta-learning, which
explicitly encourages abstraction and generalization across tasks.
♻ ☆ Enhancing Domain Adaptation through Prompt Gradient Alignment NeurIPS 2024
Prior Unsupervised Domain Adaptation (UDA) methods often aim to train a
domain-invariant feature extractor, which may hinder the model from learning
sufficiently discriminative features. To tackle this, a line of works based on
prompt learning leverages the power of large-scale pre-trained vision-language
models to learn both domain-invariant and specific features through a set of
domain-agnostic and domain-specific learnable prompts. Those studies typically
enforce invariant constraints on representation, output, or prompt space to
learn such prompts. In contrast, we cast UDA as a multiple-objective
optimization problem in which each objective is represented by a domain loss.
Under this new framework, we propose to align per-objective gradients to foster
consensus between them. Additionally, to prevent potential overfitting when
fine-tuning this deep learning architecture, we penalize the norm of these
gradients. To achieve these goals, we devise a practical gradient update
procedure that can work under both single-source and multi-source UDA.
Empirically, our method consistently outperforms other vision-language model
adaptation methods. The implementation is available at
https://github.com/VietHoang1512/PGA.
comment: Accepted to NeurIPS 2024
♻ ☆ Disentangling Safe and Unsafe Corruptions via Anisotropy and Locality
State-of-the-art machine learning systems are vulnerable to small
perturbations to their input, where ``small'' is defined according to a threat
model that assigns a positive threat to each perturbation. Most prior works
define a task-agnostic, isotropic, and global threat, like the $\ell_p$ norm,
where the magnitude of the perturbation fully determines the degree of the
threat and neither the direction of the attack nor its position in space
matter. However, common corruptions in computer vision, such as blur,
compression, or occlusions, are not well captured by such threat models. This
paper proposes a novel threat model called \texttt{Projected Displacement} (PD)
to study robustness beyond existing isotropic and global threat models. The
proposed threat model measures the threat of a perturbation via its alignment
with \textit{unsafe directions}, defined as directions in the input space along
which a perturbation of sufficient magnitude changes the ground truth class
label. Unsafe directions are identified locally for each input based on
observed training data. In this way, the PD threat model exhibits anisotropy
and locality. Experiments on Imagenet-1k data indicate that, for any input, the
set of perturbations with small PD threat includes \textit{safe} perturbations
of large $\ell_p$ norm that preserve the true label, such as noise, blur and
compression, while simultaneously excluding \textit{unsafe} perturbations that
alter the true label. Unlike perceptual threat models based on embeddings of
large-vision models, the PD threat model can be readily computed for arbitrary
classification tasks without pre-training or finetuning. Further additional
task annotation such as sensitivity to image regions or concept hierarchies can
be easily integrated into the assessment of threat and thus the PD threat model
presents practitioners with a flexible, task-driven threat specification.
comment: Published at IEEE/CVF Conference on Computer Vision and Pattern
Recognition 2025. Updated Acknowledgements