Computer Vision and Pattern Recognition 154
☆ Self-supervised Learning on Camera Trap Footage Yields a Strong Universal Face Embedder
Camera traps are revolutionising wildlife monitoring by capturing vast
amounts of visual data; however, the manual identification of individual
animals remains a significant bottleneck. This study introduces a fully
self-supervised approach to learning robust chimpanzee face embeddings from
unlabeled camera-trap footage. Leveraging the DINOv2 framework, we train Vision
Transformers on automatically mined face crops, eliminating the need for
identity labels. Our method demonstrates strong open-set re-identification
performance, surpassing supervised baselines on challenging benchmarks such as
Bossou, despite utilising no labelled data during training. This work
underscores the potential of self-supervised learning in biodiversity
monitoring and paves the way for scalable, non-invasive population studies.
comment: Accepted for publication. Project page, code and weights:
https://www.robots.ox.ac.uk/~vgg/research/ChimpUFE/
☆ EmbRACE-3K: Embodied Reasoning and Action in Complex Environments
Mingxian Lin, Wei Huang, Yitang Li, Chengjie Jiang, Kui Wu, Fangwei Zhong, Shengju Qian, Xin Wang, Xiaojuan Qi
Recent advanced vision-language models(VLMs) have demonstrated strong
performance on passive, offline image and video understanding tasks. However,
their effectiveness in embodied settings, which require online interaction and
active scene understanding remains limited. In such scenarios, an agent
perceives the environment from a first-person perspective, with each action
dynamically shaping subsequent observations. Even state-of-the-art models such
as GPT-4o, Claude 3.5 Sonnet, and Gemini 2.5 Pro struggle in open-environment
interactions, exhibiting clear limitations in spatial reasoning and
long-horizon planning. To address this gap, we introduce EmRACE-3K, a dataset
of over 3,000 language-guided tasks situated in diverse, photorealistic
environments constructed using Unreal Engine and the UnrealCV-Zoo framework.
The tasks encompass a wide range of embodied challenges, including navigation,
object manipulation, and multi-stage goal execution. Each task unfolds as a
multi-step trajectory, pairing first-person visual observations with high-level
instructions, grounded actions, and natural language rationales that express
the agent's intent at every step. Using EmRACE-3K, we establish a benchmark to
evaluate the embodied reasoning capabilities of VLMs across three key
dimensions: Exploration, Dynamic Spatial-Semantic Reasoning, and Multi-stage
Goal Execution. In zero-shot settings, all models achieve success rates below
20%, underscoring the challenge posed by our benchmark and the current
limitations of VLMs in interactive environments. To demonstrate the utility of
EmRACE-3K, we further fine-tune Qwen2.5-VL-7B using supervised learning
followed by reinforcement learning. This approach yields substantial
improvements across all three challenge categories, highlighting the dataset's
effectiveness in enabling the development of embodied reasoning capabilities.
comment: Project page: https://mxllc.github.io/EmbRACE-3K/
☆ Quantize-then-Rectify: Efficient VQ-VAE Training
Visual tokenizers are pivotal in multimodal large models, acting as bridges
between continuous inputs and discrete tokens. Nevertheless, training
high-compression-rate VQ-VAEs remains computationally demanding, often
necessitating thousands of GPU hours. This work demonstrates that a pre-trained
VAE can be efficiently transformed into a VQ-VAE by controlling quantization
noise within the VAE's tolerance threshold. We present
\textbf{Quantize-then-Rectify (ReVQ)}, a framework leveraging pre-trained VAEs
to enable rapid VQ-VAE training with minimal computational overhead. By
integrating \textbf{channel multi-group quantization} to enlarge codebook
capacity and a \textbf{post rectifier} to mitigate quantization errors, ReVQ
compresses ImageNet images into at most 512 tokens while sustaining competitive
reconstruction quality (rFID = 1.06). Significantly, ReVQ reduces training
costs by over two orders of magnitude relative to state-of-the-art approaches:
ReVQ finishes full training on a single NVIDIA 4090 in approximately 22 hours,
whereas comparable methods require 4.5 days on 32 A100 GPUs. Experimental
results show that ReVQ achieves superior efficiency-reconstruction trade-offs.
☆ ScaffoldAvatar: High-Fidelity Gaussian Avatars with Patch Expressions SIGGRAPH 2025
Shivangi Aneja, Sebastian Weiss, Irene Baeza, Prashanth Chandran, Gaspard Zoss, Matthias Nießner, Derek Bradley
Generating high-fidelity real-time animated sequences of photorealistic 3D
head avatars is important for many graphics applications, including immersive
telepresence and movies. This is a challenging problem particularly when
rendering digital avatar close-ups for showing character's facial microfeatures
and expressions. To capture the expressive, detailed nature of human heads,
including skin furrowing and finer-scale facial movements, we propose to couple
locally-defined facial expressions with 3D Gaussian splatting to enable
creating ultra-high fidelity, expressive and photorealistic 3D head avatars. In
contrast to previous works that operate on a global expression space, we
condition our avatar's dynamics on patch-based local expression features and
synthesize 3D Gaussians at a patch level. In particular, we leverage a
patch-based geometric 3D face model to extract patch expressions and learn how
to translate these into local dynamic skin appearance and motion by coupling
the patches with anchor points of Scaffold-GS, a recent hierarchical scene
representation. These anchors are then used to synthesize 3D Gaussians
on-the-fly, conditioned by patch-expressions and viewing direction. We employ
color-based densification and progressive training to obtain high-quality
results and faster convergence for high resolution 3K training images. By
leveraging patch-level expressions, ScaffoldAvatar consistently achieves
state-of-the-art performance with visually natural motion, while encompassing
diverse facial expressions and styles in real time.
comment: (SIGGRAPH 2025) Paper Video: https://youtu.be/VyWkgsGdbkk Project
Page: https://shivangi-aneja.github.io/projects/scaffoldavatar/
☆ Scene-Aware Conversational ADAS with Generative AI for Real-Time Driver Assistance
While autonomous driving technologies continue to advance, current Advanced
Driver Assistance Systems (ADAS) remain limited in their ability to interpret
scene context or engage with drivers through natural language. These systems
typically rely on predefined logic and lack support for dialogue-based
interaction, making them inflexible in dynamic environments or when adapting to
driver intent. This paper presents Scene-Aware Conversational ADAS (SC-ADAS), a
modular framework that integrates Generative AI components including large
language models, vision-to-text interpretation, and structured function calling
to enable real-time, interpretable, and adaptive driver assistance. SC-ADAS
supports multi-turn dialogue grounded in visual and sensor context, allowing
natural language recommendations and driver-confirmed ADAS control. Implemented
in the CARLA simulator with cloud-based Generative AI, the system executes
confirmed user intents as structured ADAS commands without requiring model
fine-tuning. We evaluate SC-ADAS across scene-aware, conversational, and
revisited multi-turn interactions, highlighting trade-offs such as increased
latency from vision-based context retrieval and token growth from accumulated
dialogue history. These results demonstrate the feasibility of combining
conversational reasoning, scene perception, and modular ADAS control to support
the next generation of intelligent driver assistance.
☆ National level satellite-based crop field inventories in smallholder landscapes
Philippe Rufin, Pauline Lucie Hammer, Leon-Friedrich Thomas, Sá Nogueira Lisboa, Natasha Ribeiro, Almeida Sitoe, Patrick Hostert, Patrick Meyfroidt
The design of science-based policies to improve the sustainability of
smallholder agriculture is challenged by a limited understanding of fundamental
system properties, such as the spatial distribution of active cropland and
field size. We integrate very high spatial resolution (1.5 m) Earth observation
data and deep transfer learning to derive crop field delineations in complex
agricultural systems at the national scale, while maintaining minimum reference
data requirements and enhancing transferability. We provide the first
national-level dataset of 21 million individual fields for Mozambique (covering
~800,000 km2) for 2023. Our maps separate active cropland from non-agricultural
land use with an overall accuracy of 93% and balanced omission and commission
errors. Field-level spatial agreement reached median intersection over union
(IoU) scores of 0.81, advancing the state-of-the-art in large-area field
delineation in complex smallholder systems. The active cropland maps capture
fragmented rural regions with low cropland shares not yet identified in global
land cover or cropland maps. These regions are mostly located in agricultural
frontier regions which host 7-9% of the Mozambican population. Field size in
Mozambique is very low overall, with half of the fields being smaller than 0.16
ha, and 83% smaller than 0.5 ha. Mean field size at aggregate spatial
resolution (0.05{\deg}) is 0.32 ha, but it varies strongly across gradients of
accessibility, population density, and net forest cover change. This variation
reflects a diverse set of actors, ranging from semi-subsistence smallholder
farms to medium-scale commercial farming, and large-scale farming operations.
Our results highlight that field size is a key indicator relating to
socio-economic and environmental outcomes of agriculture (e.g., food
production, livelihoods, deforestation, biodiversity), as well as their
trade-offs.
☆ Cameras as Relative Positional Encoding
Transformers are increasingly prevalent for multi-view computer vision tasks,
where geometric relationships between viewpoints are critical for 3D
perception. To leverage these relationships, multi-view transformers must use
camera geometry to ground visual tokens in 3D space. In this work, we compare
techniques for conditioning transformers on cameras: token-level raymap
encodings, attention-level relative pose encodings, and a new relative encoding
we propose -- Projective Positional Encoding (PRoPE) -- that captures complete
camera frustums, both intrinsics and extrinsics, as a relative positional
encoding. Our experiments begin by showing how relative camera conditioning
improves performance in feedforward novel view synthesis, with further gains
from PRoPE. This holds across settings: scenes with both shared and varying
intrinsics, when combining token- and attention-level conditioning, and for
generalization to inputs with out-of-distribution sequence lengths and camera
intrinsics. We then verify that these benefits persist for different tasks,
stereo depth estimation and discriminative spatial cognition, as well as larger
model sizes.
comment: Project Page: https://www.liruilong.cn/prope/
☆ BenchReAD: A systematic benchmark for retinal anomaly detection MICCAI 2025
Retinal anomaly detection plays a pivotal role in screening ocular and
systemic diseases. Despite its significance, progress in the field has been
hindered by the absence of a comprehensive and publicly available benchmark,
which is essential for the fair evaluation and advancement of methodologies.
Due to this limitation, previous anomaly detection work related to retinal
images has been constrained by (1) a limited and overly simplistic set of
anomaly types, (2) test sets that are nearly saturated, and (3) a lack of
generalization evaluation, resulting in less convincing experimental setups.
Furthermore, existing benchmarks in medical anomaly detection predominantly
focus on one-class supervised approaches (training only with negative samples),
overlooking the vast amounts of labeled abnormal data and unlabeled data that
are commonly available in clinical practice. To bridge these gaps, we introduce
a benchmark for retinal anomaly detection, which is comprehensive and
systematic in terms of data and algorithm. Through categorizing and
benchmarking previous methods, we find that a fully supervised approach
leveraging disentangled representations of abnormalities (DRA) achieves the
best performance but suffers from significant drops in performance when
encountering certain unseen anomalies. Inspired by the memory bank mechanisms
in one-class supervised learning, we propose NFM-DRA, which integrates DRA with
a Normal Feature Memory to mitigate the performance degradation, establishing a
new SOTA. The benchmark is publicly available at
https://github.com/DopamineLcy/BenchReAD.
comment: MICCAI 2025
☆ The Power of Certainty: How Confident Models Lead to Better Segmentation
Deep learning models have been proposed for automatic polyp detection and
precise segmentation of polyps during colonoscopy procedures. Although these
state-of-the-art models achieve high performance, they often require a large
number of parameters. Their complexity can make them prone to overfitting,
particularly when trained on biased datasets, and can result in poor
generalization across diverse datasets. Knowledge distillation and
self-distillation are proposed as promising strategies to mitigate the
limitations of large, over-parameterized models. These approaches, however, are
resource-intensive, often requiring multiple models and significant memory
during training. We propose a confidence-based self-distillation approach that
outperforms state-of-the-art models by utilizing only previous iteration data
storage during training, without requiring extra computation or memory usage
during testing. Our approach calculates the loss between the previous and
current iterations within a batch using a dynamic confidence coefficient. To
evaluate the effectiveness of our approach, we conduct comprehensive
experiments on the task of polyp segmentation. Our approach outperforms
state-of-the-art models and generalizes well across datasets collected from
multiple clinical centers. The code will be released to the public once the
paper is accepted.
comment: 9 pages, 3 figures
☆ Privacy-Preserving Multi-Stage Fall Detection Framework with Semi-supervised Federated Learning and Robotic Vision Confirmation
Seyed Alireza Rahimi Azghadi, Truong-Thanh-Hung Nguyen, Helene Fournier, Monica Wachowicz, Rene Richard, Francis Palma, Hung Cao
The aging population is growing rapidly, and so is the danger of falls in
older adults. A major cause of injury is falling, and detection in time can
greatly save medical expenses and recovery time. However, to provide timely
intervention and avoid unnecessary alarms, detection systems must be effective
and reliable while addressing privacy concerns regarding the user. In this
work, we propose a framework for detecting falls using several complementary
systems: a semi-supervised federated learning-based fall detection system
(SF2D), an indoor localization and navigation system, and a vision-based human
fall recognition system. A wearable device and an edge device identify a fall
scenario in the first system. On top of that, the second system uses an indoor
localization technique first to localize the fall location and then navigate a
robot to inspect the scenario. A vision-based detection system running on an
edge device with a mounted camera on a robot is used to recognize fallen
people. Each of the systems of this proposed framework achieves different
accuracy rates. Specifically, the SF2D has a 0.81% failure rate equivalent to
99.19% accuracy, while the vision-based fallen people detection achieves 96.3%
accuracy. However, when we combine the accuracy of these two systems with the
accuracy of the navigation system (95% success rate), our proposed framework
creates a highly reliable performance for fall detection, with an overall
accuracy of 99.99%. Not only is the proposed framework safe for older adults,
but it is also a privacy-preserving solution for detecting falls.
☆ GT-Loc: Unifying When and Where in Images Through a Joint Embedding Space ICCV2025
Timestamp prediction aims to determine when an image was captured using only
visual information, supporting applications such as metadata correction,
retrieval, and digital forensics. In outdoor scenarios, hourly estimates rely
on cues like brightness, hue, and shadow positioning, while seasonal changes
and weather inform date estimation. However, these visual cues significantly
depend on geographic context, closely linking timestamp prediction to
geo-localization. To address this interdependence, we introduce GT-Loc, a novel
retrieval-based method that jointly predicts the capture time (hour and month)
and geo-location (GPS coordinates) of an image. Our approach employs separate
encoders for images, time, and location, aligning their embeddings within a
shared high-dimensional feature space. Recognizing the cyclical nature of time,
instead of conventional contrastive learning with hard positives and negatives,
we propose a temporal metric-learning objective providing soft targets by
modeling pairwise time differences over a cyclical toroidal surface. We present
new benchmarks demonstrating that our joint optimization surpasses previous
time prediction methods, even those using the ground-truth geo-location as an
input during inference. Additionally, our approach achieves competitive results
on standard geo-localization tasks, and the unified embedding space facilitates
compositional and text-based image retrieval.
comment: Accepted in ICCV2025
☆ RefSTAR: Blind Facial Image Restoration with Reference Selection, Transfer, and Reconstruction
Zhicun Yin, Junjie Chen, Ming Liu, Zhixin Wang, Fan Li, Renjing Pei, Xiaoming Li, Rynson W. H. Lau, Wangmeng Zuo
Blind facial image restoration is highly challenging due to unknown complex
degradations and the sensitivity of humans to faces. Although existing methods
introduce auxiliary information from generative priors or high-quality
reference images, they still struggle with identity preservation problems,
mainly due to improper feature introduction on detailed textures. In this
paper, we focus on effectively incorporating appropriate features from
high-quality reference images, presenting a novel blind facial image
restoration method that considers reference selection, transfer, and
reconstruction (RefSTAR). In terms of selection, we construct a reference
selection (RefSel) module. For training the RefSel module, we construct a
RefSel-HQ dataset through a mask generation pipeline, which contains annotating
masks for 10,000 ground truth-reference pairs. As for the transfer, due to the
trivial solution in vanilla cross-attention operations, a feature fusion
paradigm is designed to force the features from the reference to be integrated.
Finally, we propose a reference image reconstruction mechanism that further
ensures the presence of reference image features in the output image. The cycle
consistency loss is also redesigned in conjunction with the mask. Extensive
experiments on various backbone models demonstrate superior performance,
showing better identity preservation ability and reference feature transfer
quality. Source code, dataset, and pre-trained models are available at
https://github.com/yinzhicun/RefSTAR.
☆ RAPNet: A Receptive-Field Adaptive Convolutional Neural Network for Pansharpening
Pansharpening refers to the process of integrating a high resolution
panchromatic (PAN) image with a lower resolution multispectral (MS) image to
generate a fused product, which is pivotal in remote sensing. Despite the
effectiveness of CNNs in addressing this challenge, they are inherently
constrained by the uniform application of convolutional kernels across all
spatial positions, overlooking local content variations. To overcome this
issue, we introduce RAPNet, a new architecture that leverages content-adaptive
convolution. At its core, RAPNet employs the Receptive-field Adaptive
Pansharpening Convolution (RAPConv), designed to produce spatially adaptive
kernels responsive to local feature context, thereby enhancing the precision of
spatial detail extraction. Additionally, the network integrates the
Pansharpening Dynamic Feature Fusion (PAN-DFF) module, which incorporates an
attention mechanism to achieve an optimal balance between spatial detail
enhancement and spectral fidelity. Comprehensive evaluations on publicly
available datasets confirm that RAPNet delivers superior performance compared
to existing approaches, as demonstrated by both quantitative metrics and
qualitative assessments. Ablation analyses further substantiate the
effectiveness of the proposed adaptive components.
comment: To appear in the proceedings of the 6th International Conference on
Artificial Intelligence and Electromechanical Automation (AIEA 2025). 5
pages, 6 figures
☆ CoralVQA: A Large-Scale Visual Question Answering Dataset for Coral Reef Image Understanding
Coral reefs are vital yet vulnerable ecosystems that require continuous
monitoring to support conservation. While coral reef images provide essential
information in coral monitoring, interpreting such images remains challenging
due to the need for domain expertise. Visual Question Answering (VQA), powered
by Large Vision-Language Models (LVLMs), has great potential in user-friendly
interaction with coral reef images. However, applying VQA to coral imagery
demands a dedicated dataset that addresses two key challenges: domain-specific
annotations and multidimensional questions. In this work, we introduce
CoralVQA, the first large-scale VQA dataset for coral reef analysis. It
contains 12,805 real-world coral images from 67 coral genera collected from 3
oceans, along with 277,653 question-answer pairs that comprehensively assess
ecological and health-related conditions. To construct this dataset, we develop
a semi-automatic data construction pipeline in collaboration with marine
biologists to ensure both scalability and professional-grade data quality.
CoralVQA presents novel challenges and provides a comprehensive benchmark for
studying vision-language reasoning in the context of coral reef images. By
evaluating several state-of-the-art LVLMs, we reveal key limitations and
opportunities. These insights form a foundation for future LVLM development,
with a particular emphasis on supporting coral conservation efforts.
☆ 4D-Animal: Freely Reconstructing Animatable 3D Animals from Videos
Shanshan Zhong, Jiawei Peng, Zehan Zheng, Zhongzhan Huang, Wufei Ma, Guofeng Zhang, Qihao Liu, Alan Yuille, Jieneng Chen
Existing methods for reconstructing animatable 3D animals from videos
typically rely on sparse semantic keypoints to fit parametric models. However,
obtaining such keypoints is labor-intensive, and keypoint detectors trained on
limited animal data are often unreliable. To address this, we propose
4D-Animal, a novel framework that reconstructs animatable 3D animals from
videos without requiring sparse keypoint annotations. Our approach introduces a
dense feature network that maps 2D representations to SMAL parameters,
enhancing both the efficiency and stability of the fitting process.
Furthermore, we develop a hierarchical alignment strategy that integrates
silhouette, part-level, pixel-level, and temporal cues from pre-trained 2D
visual models to produce accurate and temporally coherent reconstructions
across frames. Extensive experiments demonstrate that 4D-Animal outperforms
both model-based and model-free baselines. Moreover, the high-quality 3D assets
generated by our method can benefit other 3D tasks, underscoring its potential
for large-scale applications. The code is released at
https://github.com/zhongshsh/4D-Animal.
☆ CLA: Latent Alignment for Online Continual Self-Supervised Learning
Self-supervised learning (SSL) is able to build latent representations that
generalize well to unseen data. However, only a few SSL techniques exist for
the online CL setting, where data arrives in small minibatches, the model must
comply with a fixed computational budget, and task boundaries are absent. We
introduce Continual Latent Alignment (CLA), a novel SSL strategy for Online CL
that aligns the representations learned by the current model with past
representations to mitigate forgetting. We found that our CLA is able to speed
up the convergence of the training process in the online scenario,
outperforming state-of-the-art approaches under the same computational budget.
Surprisingly, we also discovered that using CLA as a pretraining protocol in
the early stages of pretraining leads to a better final performance when
compared to a full i.i.d. pretraining.
comment: Accepted at CoLLAs 2025 conference
☆ Text-Visual Semantic Constrained AI-Generated Image Quality Assessment
With the rapid advancements in Artificial Intelligence Generated Image (AGI)
technology, the accurate assessment of their quality has become an increasingly
vital requirement. Prevailing methods typically rely on cross-modal models like
CLIP or BLIP to evaluate text-image alignment and visual quality. However, when
applied to AGIs, these methods encounter two primary challenges: semantic
misalignment and details perception missing. To address these limitations, we
propose Text-Visual Semantic Constrained AI-Generated Image Quality Assessment
(SC-AGIQA), a unified framework that leverages text-visual semantic constraints
to significantly enhance the comprehensive evaluation of both text-image
consistency and perceptual distortion in AI-generated images. Our approach
integrates key capabilities from multiple models and tackles the aforementioned
challenges by introducing two core modules: the Text-assisted Semantic
Alignment Module (TSAM), which leverages Multimodal Large Language Models
(MLLMs) to bridge the semantic gap by generating an image description and
comparing it against the original prompt for a refined consistency check, and
the Frequency-domain Fine-Grained Degradation Perception Module (FFDPM), which
draws inspiration from Human Visual System (HVS) properties by employing
frequency domain analysis combined with perceptual sensitivity weighting to
better quantify subtle visual distortions and enhance the capture of
fine-grained visual quality details in images. Extensive experiments conducted
on multiple benchmark datasets demonstrate that SC-AGIQA outperforms existing
state-of-the-art methods. The code is publicly available at
https://github.com/mozhu1/SC-AGIQA.
comment: 9 pages, 5 figures, Accepted at ACMMM 2025
☆ Numerically Computing Galois Groups of Minimal Problems
I discuss a seemingly unlikely confluence of topics in algebra, numerical
computation, and computer vision. The motivating problem is that of solving
multiples instances of a parametric family of systems of algebraic (polynomial
or rational function) equations. No doubt already of interest to ISSAC
attendees, this problem arises in the context of robust model-fitting paradigms
currently utilized by the computer vision community (namely "Random Sampling
and Consensus", aka "RanSaC".) This talk will give an overview of work in the
last 5+ years that aspires to measure the intrinsic difficulty of solving such
parametric systems, and makes strides towards practical solutions.
comment: abstract accompanying invited tutorial at ISSAC 2025; 10 pages w/
references
☆ Text-to-Remote-Sensing-Image Retrieval beyond RGB Sources
Retrieving relevant imagery from vast satellite archives is crucial for
applications like disaster response and long-term climate monitoring. However,
most text-to-image retrieval systems are limited to RGB data, failing to
exploit the unique physical information captured by other sensors, such as the
all-weather structural sensitivity of Synthetic Aperture Radar (SAR) or the
spectral signatures in optical multispectral data. To bridge this gap, we
introduce CrisisLandMark, a new large-scale corpus of over 647,000 Sentinel-1
SAR and Sentinel-2 multispectral images paired with structured textual
annotations for land cover, land use, and crisis events harmonized from
authoritative land cover systems (CORINE and Dynamic World) and crisis-specific
sources. We then present CLOSP (Contrastive Language Optical SAR Pretraining),
a novel framework that uses text as a bridge to align unpaired optical and SAR
images into a unified embedding space. Our experiments show that CLOSP achieves
a new state-of-the-art, improving retrieval nDGC by 54% over existing models.
Additionally, we find that the unified training strategy overcomes the inherent
difficulty of interpreting SAR imagery by transferring rich semantic knowledge
from the optical domain with indirect interaction. Furthermore, GeoCLOSP, which
integrates geographic coordinates into our framework, creates a powerful
trade-off between generality and specificity: while the CLOSP excels at general
semantic tasks, the GeoCLOSP becomes a specialized expert for retrieving
location-dependent crisis events and rare geographic features. This work
highlights that the integration of diverse sensor data and geographic context
is essential for unlocking the full potential of remote sensing archives.
☆ Devanagari Handwritten Character Recognition using Convolutional Neural Network
Handwritten character recognition is getting popular among researchers
because of its possible applications in facilitating technological search
engines, social media, recommender systems, etc. The Devanagari script is one
of the oldest language scripts in India that does not have proper digitization
tools. With the advancement of computing and technology, the task of this
research is to extract handwritten Hindi characters from an image of Devanagari
script with an automated approach to save time and obsolete data. In this
paper, we present a technique to recognize handwritten Devanagari characters
using two deep convolutional neural network layers. This work employs a
methodology that is useful to enhance the recognition rate and configures a
convolutional neural network for effective Devanagari handwritten text
recognition (DHTR). This approach uses the Devanagari handwritten character
dataset (DHCD), an open dataset with 36 classes of Devanagari characters. Each
of these classes has 1700 images for training and testing purposes. This
approach obtains promising results in terms of accuracy by achieving 96.36%
accuracy in testing and 99.55% in training time.
comment: 9 pages, 6 figures
☆ Improving Remote Sensing Classification using Topological Data Analysis and Convolutional Neural Networks
Topological data analysis (TDA) is a relatively new field that is gaining
rapid adoption due to its robustness and ability to effectively describe
complex datasets by quantifying geometric information. In imaging contexts, TDA
typically models data as filtered cubical complexes from which we can extract
discriminative features using persistence homology. Meanwhile, convolutional
neural networks (CNNs) have been shown to be biased towards texture based local
features. To address this limitation, we propose a TDA feature engineering
pipeline and a simple method to integrate topological features with deep
learning models on remote sensing classification. Our method improves the
performance of a ResNet18 model on the EuroSAT dataset by 1.44% achieving
99.33% accuracy, which surpasses all previously reported single-model
accuracies, including those with larger architectures, such as ResNet50 (2x
larger) and XL Vision Transformers (197x larger). We additionally show that our
method's accuracy is 1.82% higher than our ResNet18 baseline on the RESISC45
dataset. To our knowledge, this is the first application of TDA features in
satellite scene classification with deep learning. This demonstrates that TDA
features can be integrated with deep learning models, even on datasets without
explicit topological structures, thereby increasing the applicability of TDA. A
clean implementation of our method will be made publicly available upon
publication.
comment: 9 pages, 8 figures
☆ Test-Time Canonicalization by Foundation Models for Robust Perception ICML 2025
Real-world visual perception requires invariance to diverse transformations,
yet current methods rely heavily on specialized architectures or training on
predefined augmentations, limiting generalization. We propose FOCAL, a
test-time, data-driven framework that achieves robust perception by leveraging
internet-scale visual priors from foundation models. By generating and
optimizing candidate transformations toward visually typical, "canonical"
views, FOCAL enhances robustness without re-training or architectural changes.
Our experiments demonstrate improved robustness of CLIP and SAM across
challenging transformations, including 2D/3D rotations, illumination shifts
(contrast and color), and day-night variations. We also highlight potential
applications in active vision. Our approach challenges the assumption that
transform-specific training is necessary, instead offering a scalable path to
invariance. Our code is available at: https://github.com/sutkarsh/focal.
comment: Published at ICML 2025
☆ Fine-Grained Zero-Shot Object Detection ACM MM'25
Zero-shot object detection (ZSD) aims to leverage semantic descriptions to
localize and recognize objects of both seen and unseen classes. Existing ZSD
works are mainly coarse-grained object detection, where the classes are
visually quite different, thus are relatively easy to distinguish. However, in
real life we often have to face fine-grained object detection scenarios, where
the classes are too similar to be easily distinguished. For example, detecting
different kinds of birds, fishes, and flowers.
In this paper, we propose and solve a new problem called Fine-Grained
Zero-Shot Object Detection (FG-ZSD for short), which aims to detect objects of
different classes with minute differences in details under the ZSD paradigm. We
develop an effective method called MSHC for the FG-ZSD task, which is based on
an improved two-stage detector and employs a multi-level semantics-aware
embedding alignment loss, ensuring tight coupling between the visual and
semantic spaces. Considering that existing ZSD datasets are not suitable for
the new FG-ZSD task, we build the first FG-ZSD benchmark dataset FGZSD-Birds,
which contains 148,820 images falling into 36 orders, 140 families, 579 genera
and 1432 species. Extensive experiments on FGZSD-Birds show that our method
outperforms existing ZSD models.
comment: Accepted by ACM MM'25
☆ Beyond Graph Model: Reliable VLM Fine-Tuning via Random Graph Adapter
Textual adapter-based tuning methods have shown significant potential in
transferring knowledge from pre-trained Vision-Language Models (VLMs) to
downstream tasks. Existing works generally employ the deterministic textual
feature adapter to refine each category textual representation. However, due to
inherent factors such as different attributes and contexts, there exists
significant diversity in textual descriptions for each category. Such
description diversity offers rich discriminative semantic knowledge that can
benefit downstream visual learning tasks. Obviously, traditional deterministic
adapter model cannot adequately capture this varied semantic information. Also,
it is desirable to exploit the inter-class relationships in VLM adapter. To
address these issues, we propose to exploit random graph model into VLM adapter
and develop a novel Vertex Random Graph Adapter (VRGAdapter). VRGAdapter first
models the inherent diverse descriptions of each category and inter-class
relationships of different categories simultaneously by leveraging a Vertex
Random Knowledge Graph (VRKG) model. Then, it employs probabilistic message
propagation on VRKG to learn context-aware distribution representation for each
class node. Finally, it adopts a reparameterized sampling function to achieve
textual adapter learning. Note that, VRGAdapter provides a more general adapter
solution that encompasses traditional graph-based adapter as a special case. In
addition, to enable more robust performance for downstream tasks, we also
introduce a new Uncertainty-guided Multi-branch Fusion (UMF) scheme that
dynamically integrates multiple pre-trained models for ensemble prediction.
Extensive experiments on multiple benchmark datasets demonstrate the
effectiveness of our approach.
☆ FGSSNet: Feature-Guided Semantic Segmentation of Real World Floorplans
We introduce FGSSNet, a novel multi-headed feature-guided semantic
segmentation (FGSS) architecture designed to improve the generalization ability
of wall segmentation on floorplans. FGSSNet features a U-Net segmentation
backbone with a multi-headed dedicated feature extractor used to extract
domain-specific feature maps which are injected into the latent space of U-Net
to guide the segmentation process. This dedicated feature extractor is trained
as an encoder-decoder with selected wall patches, representative of the walls
present in the input floorplan, to produce a compressed latent representation
of wall patches while jointly trained to predict the wall width. In doing so,
we expect that the feature extractor encodes texture and width features of wall
patches that are useful to guide the wall segmentation process. Our experiments
show increased performance by the use of such injected features in comparison
to the vanilla U-Net, highlighting the validity of the proposed approach.
comment: Accepted at International Workshop on Artificial Intelligence and
Pattern Recognition, IWAIPR 2025
☆ Text Embedding Knows How to Quantize Text-Guided Diffusion Models ICCV 2025
Despite the success of diffusion models in image generation tasks such as
text-to-image, the enormous computational complexity of diffusion models limits
their use in resource-constrained environments. To address this, network
quantization has emerged as a promising solution for designing efficient
diffusion models. However, existing diffusion model quantization methods do not
consider input conditions, such as text prompts, as an essential source of
information for quantization. In this paper, we propose a novel quantization
method dubbed Quantization of Language-to-Image diffusion models using text
Prompts (QLIP). QLIP leverages text prompts to guide the selection of bit
precision for every layer at each time step. In addition, QLIP can be
seamlessly integrated into existing quantization methods to enhance
quantization efficiency. Our extensive experiments demonstrate the
effectiveness of QLIP in reducing computational complexity and improving the
quality of the generated images across various datasets.
comment: ICCV 2025
☆ Mind the Gap: Aligning Vision Foundation Models to Image Feature Matching ICCV 2025
Leveraging the vision foundation models has emerged as a mainstream paradigm
that improves the performance of image feature matching. However, previous
works have ignored the misalignment when introducing the foundation models into
feature matching. The misalignment arises from the discrepancy between the
foundation models focusing on single-image understanding and the cross-image
understanding requirement of feature matching. Specifically, 1) the embeddings
derived from commonly used foundation models exhibit discrepancies with the
optimal embeddings required for feature matching; 2) lacking an effective
mechanism to leverage the single-image understanding ability into cross-image
understanding. A significant consequence of the misalignment is they struggle
when addressing multi-instance feature matching problems. To address this, we
introduce a simple but effective framework, called IMD (Image feature Matching
with a pre-trained Diffusion model) with two parts: 1) Unlike the dominant
solutions employing contrastive-learning based foundation models that emphasize
global semantics, we integrate the generative-based diffusion models to
effectively capture instance-level details. 2) We leverage the prompt mechanism
in generative model as a natural tunnel, propose a novel cross-image
interaction prompting module to facilitate bidirectional information
interaction between image pairs. To more accurately measure the misalignment,
we propose a new benchmark called IMIM, which focuses on multi-instance
scenarios. Our proposed IMD establishes a new state-of-the-art in commonly
evaluated benchmarks, and the superior improvement 12% in IMIM indicates our
method efficiently mitigates the misalignment.
comment: Accepted by ICCV 2025
☆ Contrastive Pretraining with Dual Visual Encoders for Gloss-Free Sign Language Translation
Sign Language Translation (SLT) aims to convert sign language videos into
spoken or written text. While early systems relied on gloss annotations as an
intermediate supervision, such annotations are costly to obtain and often fail
to capture the full complexity of continuous signing. In this work, we propose
a two-phase, dual visual encoder framework for gloss-free SLT, leveraging
contrastive visual-language pretraining. During pretraining, our approach
employs two complementary visual backbones whose outputs are jointly aligned
with each other and with sentence-level text embeddings via a contrastive
objective. During the downstream SLT task, we fuse the visual features and
input them into an encoder-decoder model. On the Phoenix-2014T benchmark, our
dual encoder architecture consistently outperforms its single stream variants
and achieves the highest BLEU-4 score among existing gloss-free SLT approaches.
comment: Accepted at 9th Workshop on Sign Language Translation and Avatar
Technologies (SLTAT), will be held in conjunction with IVA'25
☆ DisCo: Towards Distinct and Coherent Visual Encapsulation in Video MLLMs ICCV 2025
In video Multimodal Large Language Models (video MLLMs), the visual
encapsulation process plays a pivotal role in converting video contents into
representative tokens for LLM input. While linear projectors are widely
employed for encapsulation, they introduce semantic indistinctness and temporal
incoherence when applied to videos. Conversely, the structure of resamplers
shows promise in tackling these challenges, but an effective solution remains
unexplored. Drawing inspiration from resampler structures, we introduce DisCo,
a novel visual encapsulation method designed to yield semantically distinct and
temporally coherent visual tokens for video MLLMs. DisCo integrates two key
components: (1) A Visual Concept Discriminator (VCD) module, assigning unique
semantics for visual tokens by associating them in pair with discriminative
concepts in the video. (2) A Temporal Focus Calibrator (TFC) module, ensuring
consistent temporal focus of visual tokens to video elements across every video
frame. Through extensive experiments on multiple video MLLM frameworks, we
demonstrate that DisCo remarkably outperforms previous state-of-the-art methods
across a variety of video understanding benchmarks, while also achieving higher
token efficiency thanks to the reduction of semantic indistinctness. The code:
https://github.com/ZJHTerry18/DisCo.
comment: ICCV 2025
☆ FaceLLM: A Multimodal Large Language Model for Face Understanding ICCV 2025
Multimodal large language models (MLLMs) have shown remarkable performance in
vision-language tasks. However, existing MLLMs are primarily trained on generic
datasets, limiting their ability to reason on domain-specific visual cues such
as those in facial images. In particular, tasks that require detailed
understanding of facial structure, expression, emotion, and demographic
features remain underexplored by MLLMs due to the lack of large-scale annotated
face image-text datasets. In this work, we introduce FaceLLM, a multimodal
large language model trained specifically for facial image understanding. To
construct the training data, we propose a novel weakly supervised pipeline that
uses ChatGPT with attribute-aware prompts to generate high-quality
question-answer pairs based on images from the FairFace dataset. The resulting
corpus, called FairFaceGPT, covers a diverse set of attributes including
expression, pose, skin texture, and forensic information. Our experiments
demonstrate that FaceLLM improves the performance of MLLMs on various
face-centric tasks and achieves state-of-the-art performance. This work
highlights the potential of synthetic supervision via language models for
building domain-specialized MLLMs, and sets a precedent for trustworthy,
human-centric multimodal AI systems. FairFaceGPT dataset and pretrained FaceLLM
models are publicly available in the project page.
comment: Accepted in ICCV 2025 workshops
☆ Show and Polish: Reference-Guided Identity Preservation in Face Video Restoration
Face Video Restoration (FVR) aims to recover high-quality face videos from
degraded versions. Traditional methods struggle to preserve fine-grained,
identity-specific features when degradation is severe, often producing
average-looking faces that lack individual characteristics. To address these
challenges, we introduce IP-FVR, a novel method that leverages a high-quality
reference face image as a visual prompt to provide identity conditioning during
the denoising process. IP-FVR incorporates semantically rich identity
information from the reference image using decoupled cross-attention
mechanisms, ensuring detailed and identity consistent results. For intra-clip
identity drift (within 24 frames), we introduce an identity-preserving feedback
learning method that combines cosine similarity-based reward signals with
suffix-weighted temporal aggregation. This approach effectively minimizes drift
within sequences of frames. For inter-clip identity drift, we develop an
exponential blending strategy that aligns identities across clips by
iteratively blending frames from previous clips during the denoising process.
This method ensures consistent identity representation across different clips.
Additionally, we enhance the restoration process with a multi-stream negative
prompt, guiding the model's attention to relevant facial attributes and
minimizing the generation of low-quality or incorrect features. Extensive
experiments on both synthetic and real-world datasets demonstrate that IP-FVR
outperforms existing methods in both quality and identity preservation,
showcasing its substantial potential for practical applications in face video
restoration.
comment: Accepted by MM 2025
☆ FTCFormer: Fuzzy Token Clustering Transformer for Image Classification
Transformer-based deep neural networks have achieved remarkable success
across various computer vision tasks, largely attributed to their long-range
self-attention mechanism and scalability. However, most transformer
architectures embed images into uniform, grid-based vision tokens, neglecting
the underlying semantic meanings of image regions, resulting in suboptimal
feature representations. To address this issue, we propose Fuzzy Token
Clustering Transformer (FTCFormer), which incorporates a novel clustering-based
downsampling module to dynamically generate vision tokens based on the semantic
meanings instead of spatial positions. It allocates fewer tokens to less
informative regions and more to represent semantically important regions,
regardless of their spatial adjacency or shape irregularity. To further enhance
feature extraction and representation, we propose a Density Peak
Clustering-Fuzzy K-Nearest Neighbor (DPC-FKNN) mechanism for clustering center
determination, a Spatial Connectivity Score (SCS) for token assignment, and a
channel-wise merging (Cmerge) strategy for token merging. Extensive experiments
on 32 datasets across diverse domains validate the effectiveness of FTCFormer
on image classification, showing consistent improvements over the TCFormer
baseline, achieving gains of improving 1.43% on five fine-grained datasets,
1.09% on six natural image datasets, 0.97% on three medical datasets and 0.55%
on four remote sensing datasets. The code is available at:
https://github.com/BaoBao0926/FTCFormer/tree/main.
☆ Kaleidoscopic Background Attack: Disrupting Pose Estimation with Multi-Fold Radial Symmetry Textures ICCV 2025
Camera pose estimation is a fundamental computer vision task that is
essential for applications like visual localization and multi-view stereo
reconstruction. In the object-centric scenarios with sparse inputs, the
accuracy of pose estimation can be significantly influenced by background
textures that occupy major portions of the images across different viewpoints.
In light of this, we introduce the Kaleidoscopic Background Attack (KBA), which
uses identical segments to form discs with multi-fold radial symmetry. These
discs maintain high similarity across different viewpoints, enabling effective
attacks on pose estimation models even with natural texture segments.
Additionally, a projected orientation consistency loss is proposed to optimize
the kaleidoscopic segments, leading to significant enhancement in the attack
effectiveness. Experimental results show that optimized adversarial
kaleidoscopic backgrounds can effectively attack various camera pose estimation
models.
comment: Accepted at ICCV 2025. Project page is available at
https://wakuwu.github.io/KBA
☆ DepViT-CAD: Deployable Vision Transformer-Based Cancer Diagnosis in Histopathology
Accurate and timely cancer diagnosis from histopathological slides is vital
for effective clinical decision-making. This paper introduces DepViT-CAD, a
deployable AI system for multi-class cancer diagnosis in histopathology. At its
core is MAViT, a novel Multi-Attention Vision Transformer designed to capture
fine-grained morphological patterns across diverse tumor types. MAViT was
trained on expert-annotated patches from 1008 whole-slide images, covering 11
diagnostic categories, including 10 major cancers and non-tumor tissue.
DepViT-CAD was validated on two independent cohorts: 275 WSIs from The Cancer
Genome Atlas and 50 routine clinical cases from pathology labs, achieving
diagnostic sensitivities of 94.11% and 92%, respectively. By combining
state-of-the-art transformer architecture with large-scale real-world
validation, DepViT-CAD offers a robust and scalable approach for AI-assisted
cancer diagnostics. To support transparency and reproducibility, software and
code will be made publicly available at GitHub.
comment: 25 pages, 15 figures
☆ Transferring Styles for Reduced Texture Bias and Improved Robustness in Semantic Segmentation Networks ECAI 2025
Recent research has investigated the shape and texture biases of deep neural
networks (DNNs) in image classification which influence their generalization
capabilities and robustness. It has been shown that, in comparison to regular
DNN training, training with stylized images reduces texture biases in image
classification and improves robustness with respect to image corruptions. In an
effort to advance this line of research, we examine whether style transfer can
likewise deliver these two effects in semantic segmentation. To this end, we
perform style transfer with style varying across artificial image areas. Those
random areas are formed by a chosen number of Voronoi cells. The resulting
style-transferred data is then used to train semantic segmentation DNNs with
the objective of reducing their dependence on texture cues while enhancing
their reliance on shape-based features. In our experiments, it turns out that
in semantic segmentation, style transfer augmentation reduces texture bias and
strongly increases robustness with respect to common image corruptions as well
as adversarial attacks. These observations hold for convolutional neural
networks and transformer architectures on the Cityscapes dataset as well as on
PASCAL Context, showing the generality of the proposed method.
comment: accepted at ECAI 2025
☆ Navigating the Challenges of AI-Generated Image Detection in the Wild: What Truly Matters?
Despina Konstantinidou, Dimitrios Karageorgiou, Christos Koutlis, Olga Papadopoulou, Emmanouil Schinas, Symeon Papadopoulos
The rapid advancement of generative technologies presents both unprecedented
creative opportunities and significant challenges, particularly in maintaining
social trust and ensuring the integrity of digital information. Following these
concerns, the challenge of AI-Generated Image Detection (AID) becomes
increasingly critical. As these technologies become more sophisticated, the
quality of AI-generated images has reached a level that can easily deceive even
the most discerning observers. Our systematic evaluation highlights a critical
weakness in current AI-Generated Image Detection models: while they perform
exceptionally well on controlled benchmark datasets, they struggle
significantly with real-world variations. To assess this, we introduce ITW-SM,
a new dataset of real and AI-generated images collected from major social media
platforms. In this paper, we identify four key factors that influence AID
performance in real-world scenarios: backbone architecture, training data
composition, pre-processing strategies and data augmentation combinations. By
systematically analyzing these components, we shed light on their impact on
detection efficacy. Our modifications result in an average AUC improvement of
26.87% across various AID models under real-world conditions.
comment: 35 pages, 4 figures
☆ Synthesizing Near-Boundary OOD Samples for Out-of-Distribution Detection
Pre-trained vision-language models have exhibited remarkable abilities in
detecting out-of-distribution (OOD) samples. However, some challenging OOD
samples, which lie close to in-distribution (InD) data in image feature space,
can still lead to misclassification. The emergence of foundation models like
diffusion models and multimodal large language models (MLLMs) offers a
potential solution to this issue. In this work, we propose SynOOD, a novel
approach that harnesses foundation models to generate synthetic, challenging
OOD data for fine-tuning CLIP models, thereby enhancing boundary-level
discrimination between InD and OOD samples. Our method uses an iterative
in-painting process guided by contextual prompts from MLLMs to produce nuanced,
boundary-aligned OOD samples. These samples are refined through noise
adjustments based on gradients from OOD scores like the energy score,
effectively sampling from the InD/OOD boundary. With these carefully
synthesized images, we fine-tune the CLIP image encoder and negative label
features derived from the text encoder to strengthen connections between
near-boundary OOD samples and a set of negative labels. Finally, SynOOD
achieves state-of-the-art performance on the large-scale ImageNet benchmark,
with minimal increases in parameters and runtime. Our approach significantly
surpasses existing methods, improving AUROC by 2.80% and reducing FPR95 by
11.13%. Codes are available in https://github.com/Jarvisgivemeasuit/SynOOD.
☆ ProGait: A Multi-Purpose Video Dataset and Benchmark for Transfemoral Prosthesis Users ICCV'25
Prosthetic legs play a pivotal role in clinical rehabilitation, allowing
individuals with lower-limb amputations the ability to regain mobility and
improve their quality of life. Gait analysis is fundamental for optimizing
prosthesis design and alignment, directly impacting the mobility and life
quality of individuals with lower-limb amputations. Vision-based machine
learning (ML) methods offer a scalable and non-invasive solution to gait
analysis, but face challenges in correctly detecting and analyzing prosthesis,
due to their unique appearances and new movement patterns. In this paper, we
aim to bridge this gap by introducing a multi-purpose dataset, namely ProGait,
to support multiple vision tasks including Video Object Segmentation, 2D Human
Pose Estimation, and Gait Analysis (GA). ProGait provides 412 video clips from
four above-knee amputees when testing multiple newly-fitted prosthetic legs
through walking trials, and depicts the presence, contours, poses, and gait
patterns of human subjects with transfemoral prosthetic legs. Alongside the
dataset itself, we also present benchmark tasks and fine-tuned baseline models
to illustrate the practical application and performance of the ProGait dataset.
We compared our baseline models against pre-trained vision models,
demonstrating improved generalizability when applying the ProGait dataset for
prosthesis-specific tasks. Our code is available at
https://github.com/pittisl/ProGait and dataset at
https://huggingface.co/datasets/ericyxy98/ProGait.
comment: Accepted by ICCV'25
☆ Spatial Lifting for Dense Prediction
We present Spatial Lifting (SL), a novel methodology for dense prediction
tasks. SL operates by lifting standard inputs, such as 2D images, into a
higher-dimensional space and subsequently processing them using networks
designed for that higher dimension, such as a 3D U-Net. Counterintuitively,
this dimensionality lifting allows us to achieve good performance on benchmark
tasks compared to conventional approaches, while reducing inference costs and
significantly lowering the number of model parameters. The SL framework
produces intrinsically structured outputs along the lifted dimension. This
emergent structure facilitates dense supervision during training and enables
robust, near-zero-additional-cost prediction quality assessment at test time.
We validate our approach across 19 benchmark datasets (13 for semantic
segmentation and 6 for depth estimation), demonstrating competitive dense
prediction performance while reducing the model parameter count by over 98% (in
the U-Net case) and lowering inference costs. Spatial Lifting introduces a new
vision modeling paradigm that offers a promising path toward more efficient,
accurate, and reliable deep networks for dense prediction tasks in vision.
comment: Preprint. Under review
☆ Straighten Viscous Rectified Flow via Noise Optimization
The Reflow operation aims to straighten the inference trajectories of the
rectified flow during training by constructing deterministic couplings between
noises and images, thereby improving the quality of generated images in
single-step or few-step generation. However, we identify critical limitations
in Reflow, particularly its inability to rapidly generate high-quality images
due to a distribution gap between images in its constructed deterministic
couplings and real images. To address these shortcomings, we propose a novel
alternative called Straighten Viscous Rectified Flow via Noise Optimization
(VRFNO), which is a joint training framework integrating an encoder and a
neural velocity field. VRFNO introduces two key innovations: (1) a historical
velocity term that enhances trajectory distinction, enabling the model to more
accurately predict the velocity of the current trajectory, and (2) the noise
optimization through reparameterization to form optimized couplings with real
images which are then utilized for training, effectively mitigating errors
caused by Reflow's limitations. Comprehensive experiments on synthetic data and
real datasets with varying resolutions show that VRFNO significantly mitigates
the limitations of Reflow, achieving state-of-the-art performance in both
one-step and few-step generation tasks.
☆ From Wardrobe to Canvas: Wardrobe Polyptych LoRA for Part-level Controllable Human Image Generation
Recent diffusion models achieve personalization by learning specific
subjects, allowing learned attributes to be integrated into generated images.
However, personalized human image generation remains challenging due to the
need for precise and consistent attribute preservation (e.g., identity,
clothing details). Existing subject-driven image generation methods often
require either (1) inference-time fine-tuning with few images for each new
subject or (2) large-scale dataset training for generalization. Both approaches
are computationally expensive and impractical for real-time applications. To
address these limitations, we present Wardrobe Polyptych LoRA, a novel
part-level controllable model for personalized human image generation. By
training only LoRA layers, our method removes the computational burden at
inference while ensuring high-fidelity synthesis of unseen subjects. Our key
idea is to condition the generation on the subject's wardrobe and leverage
spatial references to reduce information loss, thereby improving fidelity and
consistency. Additionally, we introduce a selective subject region loss, which
encourages the model to disregard some of reference images during training. Our
loss ensures that generated images better align with text prompts while
maintaining subject integrity. Notably, our Wardrobe Polyptych LoRA requires no
additional parameters at the inference stage and performs generation using a
single model trained on a few training samples. We construct a new dataset and
benchmark tailored for personalized human image generation. Extensive
experiments show that our approach significantly outperforms existing
techniques in fidelity and consistency, enabling realistic and
identity-preserving full-body synthesis.
comment: 10 pages, 8 figures
☆ Boosting Multimodal Learning via Disentangled Gradient Learning ICCV2025
Multimodal learning often encounters the under-optimized problem and may have
worse performance than unimodal learning. Existing methods attribute this
problem to the imbalanced learning between modalities and rebalance them
through gradient modulation. However, they fail to explain why the dominant
modality in multimodal models also underperforms that in unimodal learning. In
this work, we reveal the optimization conflict between the modality encoder and
modality fusion module in multimodal models. Specifically, we prove that the
cross-modal fusion in multimodal models decreases the gradient passed back to
each modality encoder compared with unimodal models. Consequently, the
performance of each modality in the multimodal model is inferior to that in the
unimodal model. To this end, we propose a disentangled gradient learning (DGL)
framework to decouple the optimization of the modality encoder and modality
fusion module in the multimodal model. DGL truncates the gradient
back-propagated from the multimodal loss to the modality encoder and replaces
it with the gradient from unimodal loss. Besides, DGL removes the gradient
back-propagated from the unimodal loss to the modality fusion module. This
helps eliminate the gradient interference between the modality encoder and
modality fusion module while ensuring their respective optimization processes.
Finally, extensive experiments on multiple types of modalities, tasks, and
frameworks with dense cross-modal interaction demonstrate the effectiveness and
versatility of the proposed DGL. Code is available at
\href{https://github.com/shicaiwei123/ICCV2025-GDL}{https://github.com/shicaiwei123/ICCV2025-GDL}
comment: Accepted to ICCV2025
☆ Is Micro-expression Ethnic Leaning?
How much does ethnicity play its part in emotional expression? Emotional
expression and micro-expression research probe into understanding human
psychological responses to emotional stimuli, thereby revealing substantial
hidden yet authentic emotions that can be useful in the event of diagnosis and
interviews. While increased attention had been provided to micro-expression
analysis, the studies were done under Ekman's assumption of emotion
universality, where emotional expressions are identical across cultures and
social contexts. Our computational study uncovers some of the influences of
ethnic background in expression analysis, leading to an argument that the
emotional universality hypothesis is an overgeneralization from the perspective
of manual psychological analysis. In this research, we propose to investigate
the level of influence of ethnicity in a simulated micro-expression scenario.
We construct a cross-cultural micro-expression database and algorithmically
annotate the ethnic labels to facilitate the investigation. With the ethnically
annotated dataset, we perform a prima facie study to compare mono-ethnicity and
stereo-ethnicity in a controlled environment, which uncovers a certain
influence of ethnic bias via an experimental way. Building on this finding, we
propose a framework that integrates ethnic context into the emotional feature
learning process, yielding an ethnically aware framework that recognises
ethnicity differences in micro-expression recognition. For improved
understanding, qualitative analyses have been done to solidify the preliminary
investigation into this new realm of research. Code is publicly available at
https://github.com/IcedDoggie/ICMEW2025_EthnicMER
☆ Improving Multimodal Learning via Imbalanced Learning ICCV2025
Multimodal learning often encounters the under-optimized problem and may
perform worse than unimodal learning. Existing approaches attribute this issue
to imbalanced learning across modalities and tend to address it through
gradient balancing. However, this paper argues that balanced learning is not
the optimal setting for multimodal learning. With bias-variance analysis, we
prove that imbalanced dependency on each modality obeying the inverse ratio of
their variances contributes to optimal performance. To this end, we propose the
Asymmetric Representation Learning(ARL) strategy to assist multimodal learning
via imbalanced optimization. ARL introduces auxiliary regularizers for each
modality encoder to calculate their prediction variance. ARL then calculates
coefficients via the unimodal variance to re-weight the optimization of each
modality, forcing the modality dependence ratio to be inversely proportional to
the modality variance ratio. Moreover, to minimize the generalization error,
ARL further introduces the prediction bias of each modality and jointly
optimizes them with multimodal loss. Notably, all auxiliary regularizers share
parameters with the multimodal model and rely only on the modality
representation. Thus the proposed ARL strategy introduces no extra parameters
and is independent of the structures and fusion methods of the multimodal
model. Finally, extensive experiments on various datasets validate the
effectiveness and versatility of ARL. Code is available at
\href{https://github.com/shicaiwei123/ICCV2025-ARL}{https://github.com/shicaiwei123/ICCV2025-ARL}
comment: Accepted to ICCV2025
☆ A Training-Free, Task-Agnostic Framework for Enhancing MLLM Performance on High-Resolution Images CVPR 2025
Multimodal Large Language Models (MLLMs) have demonstrated remarkable
capabilities in vision-language understanding, reasoning, and generation.
However, they struggle with tasks requiring fine-grained localization and
reasoning in high-resolution images. This constraint stems from the fact that
MLLMs are fine-tuned with fixed image resolution to align with the pre-trained
image encoder used in MLLM. Consequently, feeding high-resolution images
directly into MLLMs leads to poor generalization due to a train-test resolution
discrepancy, while downsampling these images-although ensuring
consistency-compromises fine-grained visual details and ultimately degrades
performance. To address this challenge, we propose Extract Candidate then
Predict (ECP), a novel training-free, task-agnostic two-stage framework
designed to enhance MLLM performance on high-resolution images. The key
intuition behind ECP is that while MLLMs struggle with high-resolution images,
their predictions on downsampled images still contain implicit localization
cues. By first identifying candidate region using the coarse prediction and
then predicting the final output based on candidate region, ECP effectively
preserves fine-grained details while mitigating the challenges posed by
high-resolution data. We validate our framework on 4K GUI grounding and 4K, 8K
MLLM perception, achieving +21.3%, +5.8%, +5.2% absolute improvement compared
to baseline respectively, demonstrating its effectiveness. Code is available at
https://github.com/yenncye/ECP.
comment: Accepted at CVPR 2025 Workshop on Emergent Visual Abilities and
Limits of Foundation Models
☆ Minimizing the Pretraining Gap: Domain-aligned Text-Based Person Retrieval
In this work, we focus on text-based person retrieval, which aims to identify
individuals based on textual descriptions. Given the significant privacy issues
and the high cost associated with manual annotation, synthetic data has become
a popular choice for pretraining models, leading to notable advancements.
However, the considerable domain gap between synthetic pretraining datasets and
real-world target datasets, characterized by differences in lighting, color,
and viewpoint, remains a critical obstacle that hinders the effectiveness of
the pretrain-finetune paradigm. To bridge this gap, we introduce a unified
text-based person retrieval pipeline considering domain adaptation at both
image and region levels. In particular, it contains two primary components,
i.e., Domain-aware Diffusion (DaD) for image-level adaptation and
Multi-granularity Relation Alignment (MRA) for region-level adaptation. As the
name implies, Domain-aware Diffusion is to migrate the distribution of images
from the pretraining dataset domain to the target real-world dataset domain,
e.g., CUHK-PEDES. Subsequently, MRA performs a meticulous region-level
alignment by establishing correspondences between visual regions and their
descriptive sentences, thereby addressing disparities at a finer granularity.
Extensive experiments show that our dual-level adaptation method has achieved
state-of-the-art results on the CUHK-PEDES, ICFG-PEDES, and RSTPReid datasets,
outperforming existing methodologies. The dataset, model, and code are
available at https://github.com/Shuyu-XJTU/MRA.
☆ Learning Private Representations through Entropy-based Adversarial Training
How can we learn a representation with high predictive power while preserving
user privacy? We present an adversarial representation learning method for
sanitizing sensitive content from the learned representation. Specifically, we
introduce a variant of entropy - focal entropy, which mitigates the potential
information leakage of the existing entropy-based approaches. We showcase
feasibility on multiple benchmarks. The results suggest high target utility at
moderate privacy leakage.
☆ SlumpGuard: An AI-Powered Real-Time System for Automated Concrete Slump Prediction via Video Analysis
Concrete workability is essential for construction quality, with the slump
test being the most common on-site method for its assessment. However,
traditional slump testing is manual, time-consuming, and prone to
inconsistency, limiting its applicability for real-time monitoring. To address
these challenges, we propose SlumpGuard, an AI-powered, video-based system that
automatically analyzes concrete flow from the truck chute to assess workability
in real time. Our system enables full-batch inspection without manual
intervention, improving both the accuracy and efficiency of quality control. We
present the system design, a the construction of a dedicated dataset, and
empirical results from real-world deployment, demonstrating the effectiveness
of SlumpGuard as a practical solution for modern concrete quality assurance.
☆ Deep Recurrence for Dynamical Segmentation Models
While biological vision systems rely heavily on feedback connections to
iteratively refine perception, most artificial neural networks remain purely
feedforward, processing input in a single static pass. In this work, we propose
a predictive coding inspired feedback mechanism that introduces a recurrent
loop from output to input, allowing the model to refine its internal state over
time. We implement this mechanism within a standard U-Net architecture and
introduce two biologically motivated operations, softmax projection and
exponential decay, to ensure stability of the feedback loop. Through controlled
experiments on a synthetic segmentation task, we show that the feedback model
significantly outperforms its feedforward counterpart in noisy conditions and
generalizes more effectively with limited supervision. Notably, feedback
achieves above random performance with just two training examples, while the
feedforward model requires at least four. Our findings demonstrate that
feedback enhances robustness and data efficiency, and offer a path toward more
adaptive and biologically inspired neural architectures. Code is available at:
github.com/DCalhas/feedback_segmentation.
comment: 12 pages
☆ Probabilistic Human Intent Prediction for Mobile Manipulation: An Evaluation with Human-Inspired Constraints
Accurate inference of human intent enables human-robot collaboration without
constraining human control or causing conflicts between humans and robots. We
present GUIDER (Global User Intent Dual-phase Estimation for Robots), a
probabilistic framework that enables a robot to estimate the intent of human
operators. GUIDER maintains two coupled belief layers, one tracking navigation
goals and the other manipulation goals. In the Navigation phase, a Synergy Map
blends controller velocity with an occupancy grid to rank interaction areas.
Upon arrival at a goal, an autonomous multi-view scan builds a local 3D cloud.
The Manipulation phase combines U2Net saliency, FastSAM instance saliency, and
three geometric grasp-feasibility tests, with an end-effector kinematics-aware
update rule that evolves object probabilities in real-time. GUIDER can
recognize areas and objects of intent without predefined goals. We evaluated
GUIDER on 25 trials (five participants x five task variants) in Isaac Sim, and
compared it with two baselines, one for navigation and one for manipulation.
Across the 25 trials, GUIDER achieved a median stability of 93-100% during
navigation, compared with 60-100% for the BOIR baseline, with an improvement of
39.5% in a redirection scenario (T5). During manipulation, stability reached
94-100% (versus 69-100% for Trajectron), with a 31.4% difference in a
redirection task (T3). In geometry-constrained trials (manipulation), GUIDER
recognized the object intent three times earlier than Trajectron (median
remaining time to confident prediction 23.6 s vs 7.8 s). These results validate
our dual-phase framework and show improvements in intent inference in both
phases of mobile manipulation tasks.
comment: Submitted to Journal of Intelligent & Robotic Systems (Under Review)
☆ Taming Modern Point Tracking for Speckle Tracking Echocardiography via Impartial Motion ICCV 2025
Accurate motion estimation for tracking deformable tissues in
echocardiography is essential for precise cardiac function measurements. While
traditional methods like block matching or optical flow struggle with intricate
cardiac motion, modern point tracking approaches remain largely underexplored
in this domain. This work investigates the potential of state-of-the-art (SOTA)
point tracking methods for ultrasound, with a focus on echocardiography.
Although these novel approaches demonstrate strong performance in general
videos, their effectiveness and generalizability in echocardiography remain
limited. By analyzing cardiac motion throughout the heart cycle in real B-mode
ultrasound videos, we identify that a directional motion bias across different
views is affecting the existing training strategies. To mitigate this, we
refine the training procedure and incorporate a set of tailored augmentations
to reduce the bias and enhance tracking robustness and generalization through
impartial cardiac motion. We also propose a lightweight network leveraging
multi-scale cost volumes from spatial context alone to challenge the advanced
spatiotemporal point tracking models. Experiments demonstrate that fine-tuning
with our strategies significantly improves models' performances over their
baselines, even for out-of-distribution (OOD) cases. For instance, EchoTracker
boosts overall position accuracy by 60.7% and reduces median trajectory error
by 61.5% across heart cycle phases. Interestingly, several point tracking
models fail to outperform our proposed simple model in terms of tracking
accuracy and generalization, reflecting their limitations when applied to
echocardiography. Nevertheless, clinical evaluation reveals that these methods
improve GLS measurements, aligning more closely with expert-validated,
semi-automated tools and thus demonstrating better reproducibility in
real-world applications.
comment: Accepted to CVAMD workshop at ICCV 2025
☆ DEARLi: Decoupled Enhancement of Recognition and Localization for Semi-supervised Panoptic Segmentation ICCV 2025
Pixel-level annotation is expensive and time-consuming. Semi-supervised
segmentation methods address this challenge by learning models on few labeled
images alongside a large corpus of unlabeled images. Although foundation models
could further account for label scarcity, effective mechanisms for their
exploitation remain underexplored. We address this by devising a novel
semi-supervised panoptic approach fueled by two dedicated foundation models. We
enhance recognition by complementing unsupervised mask-transformer consistency
with zero-shot classification of CLIP features. We enhance localization by
class-agnostic decoder warm-up with respect to SAM pseudo-labels. The resulting
decoupled enhancement of recognition and localization (DEARLi) particularly
excels in the most challenging semi-supervised scenarios with large taxonomies
and limited labeled data. Moreover, DEARLi outperforms the state of the art in
semi-supervised semantic segmentation by a large margin while requiring 8x less
GPU memory, in spite of being trained only for the panoptic objective. We
observe 29.9 PQ and 38.9 mIoU on ADE20K with only 158 labeled images. The
source code is available at https://github.com/helen1c/DEARLi.
comment: ICCV 2025 Findings Workshop
☆ Glance-MCMT: A General MCMT Framework with Glance Initialization and Progressive Association
We propose a multi-camera multi-target (MCMT) tracking framework that ensures
consistent global identity assignment across views using trajectory and
appearance cues. The pipeline starts with BoT-SORT-based single-camera
tracking, followed by an initial glance phase to initialize global IDs via
trajectory-feature matching. In later frames, new tracklets are matched to
existing global identities through a prioritized global matching strategy. New
global IDs are only introduced when no sufficiently similar trajectory or
feature match is found. 3D positions are estimated using depth maps and
calibration for spatial validation.
☆ FIX-CLIP: Dual-Branch Hierarchical Contrastive Learning via Synthetic Captions for Better Understanding of Long Text
CLIP has shown promising performance across many short-text tasks in a
zero-shot manner. However, limited by the input length of the text encoder,
CLIP struggles on under-stream tasks with long-text inputs (>77 tokens). To
remedy this issue, we propose FIX-CLIP which includes three novel modules: (1)
A dual-branch training pipeline that aligns short and long texts with masked
and raw images respectively, which boosts the long-text representation while
preserving the short-text ability. (2) Multiple learnable regional prompts with
unidirectional masks in Transformer layers for regional information extraction.
(3) A hierarchical feature alignment module in the intermediate encoder layers
to promote the consistency of multi-scale features. Furthermore, we collect 30M
images and utilize existing MLLMs to synthesize long-text captions for
training. Extensive experiments show that FIX-CLIP achieves state-of-the-art
performance on both long-text and short-text retrieval benchmarks. For
downstream applications, we reveal that FIX-CLIP's text encoder delivers
promising performance in a plug-and-play manner for diffusion models with
long-text input.
☆ A Transfer Learning-Based Method for Water Body Segmentation in Remote Sensing Imagery: A Case Study of the Zhada Tulin Area
To address the prevalent challenges of domain shift and small sample sizes in
remote sensing image water body segmentation, this study proposes and validates
a two-stage transfer learning strategy based on the SegFormer model. The
approach begins by training a foundational segmentation model on a diverse
source domain, where it achieves an Intersection over Union (IoU) of 68.80% on
its validation set, followed by fine-tuning on data from the distinct target
domain. Focusing on the Zhada Tulin area in Tibet -- a region characterized by
highly complex topography and spectral features -- the experimental results
demonstrate that this strategy significantly boosts the IoU for the water body
segmentation task from 25.50% (for direct transfer) to 64.84%. This not only
effectively resolves the model performance degradation caused by domain
discrepancy but also provides an effective technical paradigm for
high-precision thematic information extraction in data-scarce and
environmentally unique remote sensing scenarios.
comment: 13 pages, 6 figures, 2 tables
☆ Frequency Regulation for Exposure Bias Mitigation in Diffusion Models
Diffusion models exhibit impressive generative capabilities but are
significantly impacted by exposure bias. In this paper, we make a key
observation: the energy of the predicted noisy images decreases during the
diffusion process. Building on this, we identify two important findings: 1) The
reduction in energy follows distinct patterns in the low-frequency and
high-frequency subbands; 2) This energy reduction results in amplitude
variations between the network-reconstructed clean data and the real clean
data. Based on the first finding, we introduce a frequency-domain regulation
mechanism utilizing wavelet transforms, which separately adjusts the low- and
high-frequency subbands. Leveraging the second insight, we provide a more
accurate analysis of exposure bias in the two subbands. Our method is
training-free and plug-and-play, significantly improving the generative quality
of various diffusion models and providing a robust solution to exposure bias
across different model architectures. The source code is available at
https://github.com/kunzhan/wpp.
comment: ACM Multimedia 2025 accepted!
☆ LayLens: Improving Deepfake Understanding through Simplified Explanations
This demonstration paper presents $\mathbf{LayLens}$, a tool aimed to make
deepfake understanding easier for users of all educational backgrounds. While
prior works often rely on outputs containing technical jargon, LayLens bridges
the gap between model reasoning and human understanding through a three-stage
pipeline: (1) explainable deepfake detection using a state-of-the-art forgery
localization model, (2) natural language simplification of technical
explanations using a vision-language model, and (3) visual reconstruction of a
plausible original image via guided image editing. The interface presents both
technical and layperson-friendly explanations in addition to a side-by-side
comparison of the uploaded and reconstructed images. A user study with 15
participants shows that simplified explanations significantly improve clarity
and reduce cognitive load, with most users expressing increased confidence in
identifying deepfakes. LayLens offers a step toward transparent, trustworthy,
and user-centric deepfake forensics.
☆ MoVieS: Motion-Aware 4D Dynamic View Synthesis in One Second
We present MoVieS, a novel feed-forward model that synthesizes 4D dynamic
novel views from monocular videos in one second. MoVieS represents dynamic 3D
scenes using pixel-aligned grids of Gaussian primitives, explicitly supervising
their time-varying motion. This allows, for the first time, the unified
modeling of appearance, geometry and motion, and enables view synthesis,
reconstruction and 3D point tracking within a single learning-based framework.
By bridging novel view synthesis with dynamic geometry reconstruction, MoVieS
enables large-scale training on diverse datasets with minimal dependence on
task-specific supervision. As a result, it also naturally supports a wide range
of zero-shot applications, such as scene flow estimation and moving object
segmentation. Extensive experiments validate the effectiveness and efficiency
of MoVieS across multiple tasks, achieving competitive performance while
offering several orders of magnitude speedups.
comment: Project page: https://chenguolin.github.io/projects/MoVieS
☆ Lightweight Model for Poultry Disease Detection from Fecal Images Using Multi-Color Space Feature Optimization and Machine Learning
Poultry farming is a vital component of the global food supply chain, yet it
remains highly vulnerable to infectious diseases such as coccidiosis,
salmonellosis, and Newcastle disease. This study proposes a lightweight machine
learning-based approach to detect these diseases by analyzing poultry fecal
images. We utilize multi-color space feature extraction (RGB, HSV, LAB) and
explore a wide range of color, texture, and shape-based descriptors, including
color histograms, local binary patterns (LBP), wavelet transforms, and edge
detectors. Through a systematic ablation study and dimensionality reduction
using PCA and XGBoost feature selection, we identify a compact global feature
set that balances accuracy and computational efficiency. An artificial neural
network (ANN) classifier trained on these features achieved 95.85% accuracy
while requiring no GPU and only 638 seconds of execution time in Google Colab.
Compared to deep learning models such as Xception and MobileNetV3, our proposed
model offers comparable accuracy with drastically lower resource usage. This
work demonstrates a cost-effective, interpretable, and scalable alternative to
deep learning for real-time poultry disease detection in low-resource
agricultural settings.
☆ CoSMo: A Multimodal Transformer for Page Stream Segmentation in Comic Books
This paper introduces CoSMo, a novel multimodal Transformer for Page Stream
Segmentation (PSS) in comic books, a critical task for automated content
understanding, as it is a necessary first stage for many downstream tasks like
character analysis, story indexing, or metadata enrichment. We formalize PSS
for this unique medium and curate a new 20,800-page annotated dataset. CoSMo,
developed in vision-only and multimodal variants, consistently outperforms
traditional baselines and significantly larger general-purpose vision-language
models across F1-Macro, Panoptic Quality, and stream-level metrics. Our
findings highlight the dominance of visual features for comic PSS
macro-structure, yet demonstrate multimodal benefits in resolving challenging
ambiguities. CoSMo establishes a new state-of-the-art, paving the way for
scalable comic book analysis.
☆ LifelongPR: Lifelong knowledge fusion for point cloud place recognition based on replay and prompt learning
Point cloud place recognition (PCPR) plays a crucial role in photogrammetry
and robotics applications such as autonomous driving, intelligent
transportation, and augmented reality. In real-world large-scale deployments of
a positioning system, PCPR models must continuously acquire, update, and
accumulate knowledge to adapt to diverse and dynamic environments, i.e., the
ability known as continual learning (CL). However, existing PCPR models often
suffer from catastrophic forgetting, leading to significant performance
degradation in previously learned scenes when adapting to new environments or
sensor types. This results in poor model scalability, increased maintenance
costs, and system deployment difficulties, undermining the practicality of
PCPR. To address these issues, we propose LifelongPR, a novel continual
learning framework for PCPR, which effectively extracts and fuses knowledge
from sequential point cloud data. First, to alleviate the knowledge loss, we
propose a replay sample selection method that dynamically allocates sample
sizes according to each dataset's information quantity and selects spatially
diverse samples for maximal representativeness. Second, to handle domain
shifts, we design a prompt learning-based CL framework with a lightweight
prompt module and a two-stage training strategy, enabling domain-specific
feature adaptation while minimizing forgetting. Comprehensive experiments on
large-scale public and self-collected datasets are conducted to validate the
effectiveness of the proposed method. Compared with state-of-the-art (SOTA)
methods, our method achieves 6.50% improvement in mIR@1, 7.96% improvement in
mR@1, and an 8.95% reduction in F. The code and pre-trained models are publicly
available at https://github.com/zouxianghong/LifelongPR.
☆ Memory-Efficient Personalization of Text-to-Image Diffusion Models via Selective Optimization Strategies
Memory-efficient personalization is critical for adapting text-to-image
diffusion models while preserving user privacy and operating within the limited
computational resources of edge devices. To this end, we propose a selective
optimization framework that adaptively chooses between backpropagation on
low-resolution images (BP-low) and zeroth-order optimization on high-resolution
images (ZO-high), guided by the characteristics of the diffusion process. As
observed in our experiments, BP-low efficiently adapts the model to
target-specific features, but suffers from structural distortions due to
resolution mismatch. Conversely, ZO-high refines high-resolution details with
minimal memory overhead but faces slow convergence when applied without prior
adaptation. By complementing both methods, our framework leverages BP-low for
effective personalization while using ZO-high to maintain structural
consistency, achieving memory-efficient and high-quality fine-tuning. To
maximize the efficacy of both BP-low and ZO-high, we introduce a timestep-aware
probabilistic function that dynamically selects the appropriate optimization
strategy based on diffusion timesteps. This function mitigates the overfitting
from BP-low at high timesteps, where structural information is critical, while
ensuring ZO-high is applied more effectively as training progresses.
Experimental results demonstrate that our method achieves competitive
performance while significantly reducing memory consumption, enabling scalable,
high-quality on-device personalization without increasing inference latency.
☆ (Almost) Free Modality Stitching of Foundation Models
Foundation multi-modal models are often designed by stitching of multiple
existing pretrained uni-modal models: for example, an image classifier with an
autoregressive text model. This stitching process is performed by training a
connector module that aims to align the representation-representation or
representation-input spaces of these uni-modal models. However, given the
complexity of training such connectors on large scale web-based datasets
coupled with the ever-increasing number of available pretrained uni-modal
models, the task of uni-modal models selection and subsequent connector module
training becomes computationally demanding. To address this under-studied
critical problem, we propose Hypernetwork Model Alignment (Hyma), a novel
all-in-one solution for optimal uni-modal model selection and connector
training by leveraging hypernetworks. Specifically, our framework utilizes the
parameter prediction capability of a hypernetwork to obtain jointly trained
connector modules for $N \times M$ combinations of uni-modal models. In our
experiments, Hyma reduces the optimal uni-modal model pair search cost by
$10\times$ (averaged across all experiments), while matching the ranking and
trained connector performance obtained via grid search across a suite of
diverse multi-modal benchmarks.
comment: Pre-print
☆ Cross-modal Associations in Vision and Language Models: Revisiting the bouba-kiki effect
Recent advances in multimodal models have raised questions about whether
vision-and-language models (VLMs) integrate cross-modal information in ways
that reflect human cognition. One well-studied test case in this domain is the
bouba-kiki effect, where humans reliably associate pseudowords like "bouba"
with round shapes and "kiki" with jagged ones. Given the mixed evidence found
in prior studies for this effect in VLMs, we present a comprehensive
re-evaluation focused on two variants of CLIP, ResNet and Vision Transformer
(ViT), given their centrality in many state-of-the-art VLMs. We apply two
complementary methods closely modelled after human experiments: a prompt-based
evaluation that uses probabilities as model preference, and we use Grad-CAM as
a novel way to interpret visual attention in shape-word matching tasks. Our
findings show that these models do not consistently exhibit the bouba-kiki
effect. While ResNet shows a preference for round shapes, overall performance
across both models lacks the expected associations. Moreover, direct comparison
with prior human data on the same task shows that the models' responses fall
markedly short of the robust, modality-integrated behaviour characteristic of
human cognition. These results contribute to the ongoing debate about the
extent to which VLMs truly understand cross-modal concepts, highlighting
limitations in their internal representations and alignment with human
intuitions.
☆ Binomial Self-Compensation: Mechanism and Suppression of Motion Error in Phase-Shifting Profilometry
Phase shifting profilometry (PSP) is widely used in high-precision 3D
scanning due to its high accuracy, robustness, and pixel-wise handling.
However, a fundamental assumption of PSP that the object should remain static
does not hold in dynamic measurement, making PSP susceptible to object motion.
To address this challenge, our proposed solution, phase-sequential binomial
self-compensation (P-BSC), sums successive motion-affected phase frames
weighted by binomial coefficients. This approach exponentially reduces the
motion error in a pixel-wise and frame-wise loopable manner. Despite its
efficacy, P-BSC suffers from high computational overhead and error accumulation
due to its reliance on multi-frame phase calculations and weighted summations.
Inspired by P-BSC, we propose an image-sequential binomial self-compensation
(I-BSC) to weight sum the homogeneous fringe images instead of successive phase
frames, which generalizes the BSC concept from phase sequences to image
sequences. I-BSC computes the arctangent function only once, resolving both
limitations in P-BSC. Extensive analysis, simulations, and experiments show
that 1) the proposed BSC outperforms existing methods in reducing motion error
while achieving a quasi-single-shot frame rate, i.e., depth map frame rate
equals to the camera's acquisition rate, enabling 3D reconstruction with high
pixel-depth-temporal resolution; 2) compared to P-BSC, our I-BSC reduces the
computational complexity by one polynomial order, thereby accelerating the
computational frame rate by several to dozen times, while also reaching faster
motion error convergence.
☆ Vision-Based Anti Unmanned Aerial Technology: Opportunities and Challenges
With the rapid advancement of UAV technology and its extensive application in
various fields such as military reconnaissance, environmental monitoring, and
logistics, achieving efficient and accurate Anti-UAV tracking has become
essential. The importance of Anti-UAV tracking is increasingly prominent,
especially in scenarios such as public safety, border patrol, search and
rescue, and agricultural monitoring, where operations in complex environments
can provide enhanced security. Current mainstream Anti-UAV tracking
technologies are primarily centered around computer vision techniques,
particularly those that integrate multi-sensor data fusion with advanced
detection and tracking algorithms. This paper first reviews the characteristics
and current challenges of Anti-UAV detection and tracking technologies. Next,
it investigates and compiles several publicly available datasets, providing
accessible links to support researchers in efficiently addressing related
challenges. Furthermore, the paper analyzes the major vision-based and
vision-fusion-based Anti-UAV detection and tracking algorithms proposed in
recent years. Finally, based on the above research, this paper outlines future
research directions, aiming to provide valuable insights for advancing the
field.
☆ Leveraging Swin Transformer for enhanced diagnosis of Alzheimer's disease using multi-shell diffusion MRI
Objective: This study aims to support early diagnosis of Alzheimer's disease
and detection of amyloid accumulation by leveraging the microstructural
information available in multi-shell diffusion MRI (dMRI) data, using a vision
transformer-based deep learning framework.
Methods: We present a classification pipeline that employs the Swin
Transformer, a hierarchical vision transformer model, on multi-shell dMRI data
for the classification of Alzheimer's disease and amyloid presence. Key metrics
from DTI and NODDI were extracted and projected onto 2D planes to enable
transfer learning with ImageNet-pretrained models. To efficiently adapt the
transformer to limited labeled neuroimaging data, we integrated Low-Rank
Adaptation. We assessed the framework on diagnostic group prediction
(cognitively normal, mild cognitive impairment, Alzheimer's disease dementia)
and amyloid status classification.
Results: The framework achieved competitive classification results within the
scope of multi-shell dMRI-based features, with the best balanced accuracy of
95.2% for distinguishing cognitively normal individuals from those with
Alzheimer's disease dementia using NODDI metrics. For amyloid detection, it
reached 77.2% balanced accuracy in distinguishing amyloid-positive mild
cognitive impairment/Alzheimer's disease dementia subjects from
amyloid-negative cognitively normal subjects, and 67.9% for identifying
amyloid-positive individuals among cognitively normal subjects. Grad-CAM-based
explainability analysis identified clinically relevant brain regions, including
the parahippocampal gyrus and hippocampus, as key contributors to model
predictions.
Conclusion: This study demonstrates the promise of diffusion MRI and
transformer-based architectures for early detection of Alzheimer's disease and
amyloid pathology, supporting biomarker-driven diagnostics in data-limited
biomedical settings.
☆ Graph-based Multi-Modal Interaction Lightweight Network for Brain Tumor Segmentation (GMLN-BTS) in Edge Iterative MRI Lesion Localization System (EdgeIMLocSys)
Brain tumor segmentation plays a critical role in clinical diagnosis and
treatment planning, yet the variability in imaging quality across different MRI
scanners presents significant challenges to model generalization. To address
this, we propose the Edge Iterative MRI Lesion Localization System
(EdgeIMLocSys), which integrates Continuous Learning from Human Feedback to
adaptively fine-tune segmentation models based on clinician feedback, thereby
enhancing robustness to scanner-specific imaging characteristics. Central to
this system is the Graph-based Multi-Modal Interaction Lightweight Network for
Brain Tumor Segmentation (GMLN-BTS), which employs a Modality-Aware Adaptive
Encoder (M2AE) to extract multi-scale semantic features efficiently, and a
Graph-based Multi-Modal Collaborative Interaction Module (G2MCIM) to model
complementary cross-modal relationships via graph structures. Additionally, we
introduce a novel Voxel Refinement UpSampling Module (VRUM) that
synergistically combines linear interpolation and multi-scale transposed
convolutions to suppress artifacts while preserving high-frequency details,
improving segmentation boundary accuracy. Our proposed GMLN-BTS model achieves
a Dice score of 85.1% on the BraTS2017 dataset with only 4.58 million
parameters, representing a 98% reduction compared to mainstream 3D Transformer
models, and significantly outperforms existing lightweight approaches. This
work demonstrates a synergistic breakthrough in achieving high-accuracy,
resource-efficient brain tumor segmentation suitable for deployment in
resource-constrained clinical environments.
☆ 3DGAA: Realistic and Robust 3D Gaussian-based Adversarial Attack for Autonomous Driving WACV 2026
Camera-based object detection systems play a vital role in autonomous
driving, yet they remain vulnerable to adversarial threats in real-world
environments. While existing 2D and 3D physical attacks typically optimize
texture, they often struggle to balance physical realism and attack robustness.
In this work, we propose 3D Gaussian-based Adversarial Attack (3DGAA), a novel
adversarial object generation framework that leverages the full 14-dimensional
parameterization of 3D Gaussian Splatting (3DGS) to jointly optimize geometry
and appearance in physically realizable ways. Unlike prior works that rely on
patches or texture, 3DGAA jointly perturbs both geometric attributes (shape,
scale, rotation) and appearance attributes (color, opacity) to produce
physically realistic and transferable adversarial objects. We further introduce
a physical filtering module to preserve geometric fidelity, and a physical
augmentation module to simulate complex physical scenarios, thus enhancing
attack generalization under real-world conditions. We evaluate 3DGAA on both
virtual benchmarks and physical-world setups using miniature vehicle models.
Experimental results show that 3DGAA achieves to reduce the detection mAP from
87.21% to 7.38%, significantly outperforming existing 3D physical attacks.
Moreover, our method maintains high transferability across different physical
conditions, demonstrating a new state-of-the-art in physically realizable
adversarial attacks. These results validate 3DGAA as a practical attack
framework for evaluating the safety of perception systems in autonomous
driving.
comment: Submitted to WACV 2026
☆ Latent Diffusion Models with Masked AutoEncoders
In spite of remarkable potential of the Latent Diffusion Models (LDMs) in
image generation, the desired properties and optimal design of the autoencoders
have been underexplored. In this work, we analyze the role of autoencoders in
LDMs and identify three key properties: latent smoothness, perceptual
compression quality, and reconstruction quality. We demonstrate that existing
autoencoders fail to simultaneously satisfy all three properties, and propose
Variational Masked AutoEncoders (VMAEs), taking advantage of the hierarchical
features maintained by Masked AutoEncoder. We integrate VMAEs into the LDM
framework, introducing Latent Diffusion Models with Masked AutoEncoders
(LDMAEs). Through comprehensive experiments, we demonstrate significantly
enhanced image generation quality and computational efficiency.
☆ Uncertainty Quantification for Incomplete Multi-View Data Using Divergence Measures
Existing multi-view classification and clustering methods typically improve
task accuracy by leveraging and fusing information from different views.
However, ensuring the reliability of multi-view integration and final decisions
is crucial, particularly when dealing with noisy or corrupted data. Current
methods often rely on Kullback-Leibler (KL) divergence to estimate uncertainty
of network predictions, ignoring domain gaps between different modalities. To
address this issue, KPHD-Net, based on H\"older divergence, is proposed for
multi-view classification and clustering tasks. Generally, our KPHD-Net employs
a variational Dirichlet distribution to represent class probability
distributions, models evidences from different views, and then integrates it
with Dempster-Shafer evidence theory (DST) to improve uncertainty estimation
effects. Our theoretical analysis demonstrates that Proper H\"older divergence
offers a more effective measure of distribution discrepancies, ensuring
enhanced performance in multi-view learning. Moreover, Dempster-Shafer evidence
theory, recognized for its superior performance in multi-view fusion tasks, is
introduced and combined with the Kalman filter to provide future state
estimations. This integration further enhances the reliability of the final
fusion results. Extensive experiments show that the proposed KPHD-Net
outperforms the current state-of-the-art methods in both classification and
clustering tasks regarding accuracy, robustness, and reliability, with
theoretical guarantees.
☆ A Brain Tumor Segmentation Method Based on CLIP and 3D U-Net with Cross-Modal Semantic Guidance and Multi-Level Feature Fusion
Precise segmentation of brain tumors from magnetic resonance imaging (MRI) is
essential for neuro-oncology diagnosis and treatment planning. Despite advances
in deep learning methods, automatic segmentation remains challenging due to
tumor morphological heterogeneity and complex three-dimensional spatial
relationships. Current techniques primarily rely on visual features extracted
from MRI sequences while underutilizing semantic knowledge embedded in medical
reports. This research presents a multi-level fusion architecture that
integrates pixel-level, feature-level, and semantic-level information,
facilitating comprehensive processing from low-level data to high-level
concepts. The semantic-level fusion pathway combines the semantic understanding
capabilities of Contrastive Language-Image Pre-training (CLIP) models with the
spatial feature extraction advantages of 3D U-Net through three mechanisms:
3D-2D semantic bridging, cross-modal semantic guidance, and semantic-based
attention mechanisms. Experimental validation on the BraTS 2020 dataset
demonstrates that the proposed model achieves an overall Dice coefficient of
0.8567, representing a 4.8% improvement compared to traditional 3D U-Net, with
a 7.3% Dice coefficient increase in the clinically important enhancing tumor
(ET) region.
comment: 13 pages,6 figures
☆ 4D-MISR: A unified model for low-dose super-resolution imaging via feature fusion
Zifei Wang, Zian Mao, Xiaoya He, Xi Huang, Haoran Zhang, Chun Cheng, Shufen Chu, Tingzheng Hou, Xiaoqin Zeng, Yujun Xie
While electron microscopy offers crucial atomic-resolution insights into
structure-property relationships, radiation damage severely limits its use on
beam-sensitive materials like proteins and 2D materials. To overcome this
challenge, we push beyond the electron dose limits of conventional electron
microscopy by adapting principles from multi-image super-resolution (MISR) that
have been widely used in remote sensing. Our method fuses multiple
low-resolution, sub-pixel-shifted views and enhances the reconstruction with a
convolutional neural network (CNN) that integrates features from synthetic,
multi-angle observations. We developed a dual-path, attention-guided network
for 4D-STEM that achieves atomic-scale super-resolution from ultra-low-dose
data. This provides robust atomic-scale visualization across amorphous,
semi-crystalline, and crystalline beam-sensitive specimens. Systematic
evaluations on representative materials demonstrate comparable spatial
resolution to conventional ptychography under ultra-low-dose conditions. Our
work expands the capabilities of 4D-STEM, offering a new and generalizable
method for the structural analysis of radiation-vulnerable materials.
☆ Can GPT-4o mini and Gemini 2.0 Flash Predict Fine-Grained Fashion Product Attributes? A Zero-Shot Analysis
The fashion retail business is centered around the capacity to comprehend
products. Product attribution helps in comprehending products depending on the
business process. Quality attribution improves the customer experience as they
navigate through millions of products offered by a retail website. It leads to
well-organized product catalogs. In the end, product attribution directly
impacts the 'discovery experience' of the customer. Although large language
models (LLMs) have shown remarkable capabilities in understanding multimodal
data, their performance on fine-grained fashion attribute recognition remains
under-explored. This paper presents a zero-shot evaluation of state-of-the-art
LLMs that balance performance with speed and cost efficiency, mainly
GPT-4o-mini and Gemini 2.0 Flash. We have used the dataset
DeepFashion-MultiModal (https://github.com/yumingj/DeepFashion-MultiModal) to
evaluate these models in the attribution tasks of fashion products. Our study
evaluates these models across 18 categories of fashion attributes, offering
insight into where these models excel. We only use images as the sole input for
product information to create a constrained environment. Our analysis shows
that Gemini 2.0 Flash demonstrates the strongest overall performance with a
macro F1 score of 56.79% across all attributes, while GPT-4o-mini scored a
macro F1 score of 43.28%. Through detailed error analysis, our findings provide
practical insights for deploying these LLMs in production e-commerce product
attribution-related tasks and highlight the need for domain-specific
fine-tuning approaches. This work also lays the groundwork for future research
in fashion AI and multimodal attribute extraction.
comment: 11 pages, 2 figures
☆ ESG-Net: Event-Aware Semantic Guided Network for Dense Audio-Visual Event Localization
Dense audio-visual event localization (DAVE) aims to identify event
categories and locate the temporal boundaries in untrimmed videos. Most studies
only employ event-related semantic constraints on the final outputs, lacking
cross-modal semantic bridging in intermediate layers. This causes modality
semantic gap for further fusion, making it difficult to distinguish between
event-related content and irrelevant background content. Moreover, they rarely
consider the correlations between events, which limits the model to infer
concurrent events among complex scenarios. In this paper, we incorporate
multi-stage semantic guidance and multi-event relationship modeling, which
respectively enable hierarchical semantic understanding of audio-visual events
and adaptive extraction of event dependencies, thereby better focusing on
event-related information. Specifically, our eventaware semantic guided network
(ESG-Net) includes a early semantics interaction (ESI) module and a mixture of
dependency experts (MoDE) module. ESI applys multi-stage semantic guidance to
explicitly constrain the model in learning semantic information through
multi-modal early fusion and several classification loss functions, ensuring
hierarchical understanding of event-related content. MoDE promotes the
extraction of multi-event dependencies through multiple serial mixture of
experts with adaptive weight allocation. Extensive experiments demonstrate that
our method significantly surpasses the state-of-the-art methods, while greatly
reducing parameters and computational load. Our code will be released on
https://github.com/uchiha99999/ESG-Net.
☆ IM-LUT: Interpolation Mixing Look-Up Tables for Image Super-Resolution ICCV 2025
Super-resolution (SR) has been a pivotal task in image processing, aimed at
enhancing image resolution across various applications. Recently, look-up table
(LUT)-based approaches have attracted interest due to their efficiency and
performance. However, these methods are typically designed for fixed scale
factors, making them unsuitable for arbitrary-scale image SR (ASISR). Existing
ASISR techniques often employ implicit neural representations, which come with
considerable computational cost and memory demands. To address these
limitations, we propose Interpolation Mixing LUT (IM-LUT), a novel framework
that operates ASISR by learning to blend multiple interpolation functions to
maximize their representational capacity. Specifically, we introduce IM-Net, a
network trained to predict mixing weights for interpolation functions based on
local image patterns and the target scale factor. To enhance efficiency of
interpolation-based methods, IM-Net is transformed into IM-LUT, where LUTs are
employed to replace computationally expensive operations, enabling lightweight
and fast inference on CPUs while preserving reconstruction quality.
Experimental results on several benchmark datasets demonstrate that IM-LUT
consistently achieves a superior balance between image quality and efficiency
compared to existing methods, highlighting its potential as a promising
solution for resource-constrained applications.
comment: ICCV 2025
☆ Crucial-Diff: A Unified Diffusion Model for Crucial Image and Annotation Synthesis in Data-scarce Scenarios
The scarcity of data in various scenarios, such as medical, industry and
autonomous driving, leads to model overfitting and dataset imbalance, thus
hindering effective detection and segmentation performance. Existing studies
employ the generative models to synthesize more training samples to mitigate
data scarcity. However, these synthetic samples are repetitive or simplistic
and fail to provide "crucial information" that targets the downstream model's
weaknesses. Additionally, these methods typically require separate training for
different objects, leading to computational inefficiencies. To address these
issues, we propose Crucial-Diff, a domain-agnostic framework designed to
synthesize crucial samples. Our method integrates two key modules. The Scene
Agnostic Feature Extractor (SAFE) utilizes a unified feature extractor to
capture target information. The Weakness Aware Sample Miner (WASM) generates
hard-to-detect samples using feedback from the detection results of downstream
model, which is then fused with the output of SAFE module. Together, our
Crucial-Diff framework generates diverse, high-quality training data, achieving
a pixel-level AP of 83.63% and an F1-MAX of 78.12% on MVTec. On polyp dataset,
Crucial-Diff reaches an mIoU of 81.64% and an mDice of 87.69%. Code will be
released after acceptance.
☆ IGD: Instructional Graphic Design with Multimodal Layer Generation ICCV 2025
Graphic design visually conveys information and data by creating and
combining text, images and graphics. Two-stage methods that rely primarily on
layout generation lack creativity and intelligence, making graphic design still
labor-intensive. Existing diffusion-based methods generate non-editable graphic
design files at image level with poor legibility in visual text rendering,
which prevents them from achieving satisfactory and practical automated graphic
design. In this paper, we propose Instructional Graphic Designer (IGD) to
swiftly generate multimodal layers with editable flexibility with only natural
language instructions. IGD adopts a new paradigm that leverages parametric
rendering and image asset generation. First, we develop a design platform and
establish a standardized format for multi-scenario design files, thus laying
the foundation for scaling up data. Second, IGD utilizes the multimodal
understanding and reasoning capabilities of MLLM to accomplish attribute
prediction, sequencing and layout of layers. It also employs a diffusion model
to generate image content for assets. By enabling end-to-end training, IGD
architecturally supports scalability and extensibility in complex graphic
design tasks. The superior experimental results demonstrate that IGD offers a
new solution for graphic design.
comment: ICCV 2025
☆ Advanced U-Net Architectures with CNN Backbones for Automated Lung Cancer Detection and Segmentation in Chest CT Images
This study investigates the effectiveness of U-Net architectures integrated
with various convolutional neural network (CNN) backbones for automated lung
cancer detection and segmentation in chest CT images, addressing the critical
need for accurate diagnostic tools in clinical settings. A balanced dataset of
832 chest CT images (416 cancerous and 416 non-cancerous) was preprocessed
using Contrast Limited Adaptive Histogram Equalization (CLAHE) and resized to
128x128 pixels. U-Net models were developed with three CNN backbones: ResNet50,
VGG16, and Xception, to segment lung regions. After segmentation, CNN-based
classifiers and hybrid models combining CNN feature extraction with traditional
machine learning classifiers (Support Vector Machine, Random Forest, and
Gradient Boosting) were evaluated using 5-fold cross-validation. Metrics
included accuracy, precision, recall, F1-score, Dice coefficient, and ROC-AUC.
U-Net with ResNet50 achieved the best performance for cancerous lungs (Dice:
0.9495, Accuracy: 0.9735), while U-Net with VGG16 performed best for
non-cancerous segmentation (Dice: 0.9532, Accuracy: 0.9513). For
classification, the CNN model using U-Net with Xception achieved 99.1 percent
accuracy, 99.74 percent recall, and 99.42 percent F1-score. The hybrid
CNN-SVM-Xception model achieved 96.7 percent accuracy and 97.88 percent
F1-score. Compared to prior methods, our framework consistently outperformed
existing models. In conclusion, combining U-Net with advanced CNN backbones
provides a powerful method for both segmentation and classification of lung
cancer in CT scans, supporting early diagnosis and clinical decision-making.
comment: This manuscript has 20 pages and 10 figures. It is submitted to the
Journal 'Scientific Reports'
☆ Measuring the Impact of Rotation Equivariance on Aerial Object Detection ICCV 2025
Due to the arbitrary orientation of objects in aerial images, rotation
equivariance is a critical property for aerial object detectors. However,
recent studies on rotation-equivariant aerial object detection remain scarce.
Most detectors rely on data augmentation to enable models to learn
approximately rotation-equivariant features. A few detectors have constructed
rotation-equivariant networks, but due to the breaking of strict rotation
equivariance by typical downsampling processes, these networks only achieve
approximately rotation-equivariant backbones. Whether strict rotation
equivariance is necessary for aerial image object detection remains an open
question. In this paper, we implement a strictly rotation-equivariant backbone
and neck network with a more advanced network structure and compare it with
approximately rotation-equivariant networks to quantitatively measure the
impact of rotation equivariance on the performance of aerial image detectors.
Additionally, leveraging the inherently grouped nature of rotation-equivariant
features, we propose a multi-branch head network that reduces the parameter
count while improving detection accuracy. Based on the aforementioned
improvements, this study proposes the Multi-branch head rotation-equivariant
single-stage Detector (MessDet), which achieves state-of-the-art performance on
the challenging aerial image datasets DOTA-v1.0, DOTA-v1.5 and DIOR-R with an
exceptionally low parameter count.
comment: Accepted by ICCV 2025
☆ MCGA: Mixture of Codebooks Hyperspectral Reconstruction via Grayscale-Aware Attention
Reconstructing hyperspectral images (HSI) from RGB images is a cost-effective
solution for various vision-based applications. However, most existing
learning-based hyperspectral reconstruction methods directly learn the
RGB-to-HSI mapping using complex attention mechanisms, neglecting the inherent
challenge of transitioning from low-dimensional to high-dimensional
information. To address this limitation, we propose a two-stage approach, MCGA,
which first learns spectral patterns before estimating the mapping. In the
first stage, a multi-scale VQ-VAE learns representations from heterogeneous HSI
datasets, extracting a Mixture of Codebooks (MoC). In the second stage, the
RGB-to-HSI mapping is refined by querying features from the MoC to replace
latent HSI representations, incorporating prior knowledge rather than forcing a
direct high-dimensional transformation. To further enhance reconstruction
quality, we introduce Grayscale-Aware Attention and Quantized Self-Attention,
which adaptively adjust feature map intensities to meet hyperspectral
reconstruction requirements. This physically motivated attention mechanism
ensures lightweight and efficient HSI recovery. Moreover, we propose an
entropy-based Test-Time Adaptation strategy to improve robustness in real-world
scenarios. Extensive experiments demonstrate that our method, MCGA, achieves
state-of-the-art performance. The code and models will be released at
https://github.com/Fibonaccirabbit/MCGA
☆ Counterfactual Visual Explanation via Causally-Guided Adversarial Steering
Recent work on counterfactual visual explanations has contributed to making
artificial intelligence models more explainable by providing visual
perturbation to flip the prediction. However, these approaches neglect the
causal relationships and the spurious correlations behind the image generation
process, which often leads to unintended alterations in the counterfactual
images and renders the explanations with limited quality. To address this
challenge, we introduce a novel framework CECAS, which first leverages a
causally-guided adversarial method to generate counterfactual explanations. It
innovatively integrates a causal perspective to avoid unwanted perturbations on
spurious factors in the counterfactuals. Extensive experiments demonstrate that
our method outperforms existing state-of-the-art approaches across multiple
benchmark datasets and ultimately achieves a balanced trade-off among various
aspects of validity, sparsity, proximity, and realism.
☆ OpenHuman4D: Open-Vocabulary 4D Human Parsing
Understanding dynamic 3D human representation has become increasingly
critical in virtual and extended reality applications. However, existing human
part segmentation methods are constrained by reliance on closed-set datasets
and prolonged inference times, which significantly restrict their
applicability. In this paper, we introduce the first 4D human parsing framework
that simultaneously addresses these challenges by reducing the inference time
and introducing open-vocabulary capabilities. Building upon state-of-the-art
open-vocabulary 3D human parsing techniques, our approach extends the support
to 4D human-centric video with three key innovations: 1) We adopt mask-based
video object tracking to efficiently establish spatial and temporal
correspondences, avoiding the necessity of segmenting all frames. 2) A novel
Mask Validation module is designed to manage new target identification and
mitigate tracking failures. 3) We propose a 4D Mask Fusion module, integrating
memory-conditioned attention and logits equalization for robust embedding
fusion. Extensive experiments demonstrate the effectiveness and flexibility of
the proposed method on 4D human-centric parsing tasks, achieving up to 93.3%
acceleration compared to the previous state-of-the-art method, which was
limited to parsing fixed classes.
☆ ViTCoT: Video-Text Interleaved Chain-of-Thought for Boosting Video Understanding in Large Language Models ACM MM 2025
Video understanding plays a vital role in bridging low-level visual signals
with high-level cognitive reasoning, and is fundamental to applications such as
autonomous driving, embodied AI, and the broader pursuit of AGI. The rapid
development of large language models (LLMs), particularly those utilizing
Chain-of-Thought (CoT) technology, has significantly advanced video reasoning
capabilities. However, current approaches primarily depend on textual
information for reasoning, overlooking the visual modality in the actual video
reasoning process. In contrast, humans naturally re-examine visual content
while reasoning. Motivated by this, we introduce a novel video reasoning
paradigm: Video-Text Interleaved CoT (ViTCoT), which facilitates more intuitive
and cognitively aligned reasoning. To the end, first, we construct the
Video-Text Interleaved Benchmark (ViTIB), which is created using MLLMs for
key-video selection and manually verified. Furthermore, we extensively explore
the potential of the ViTCoT paradigm in the video understanding field.
Extensive experiments demonstrate that ViTCoT significantly enhances
performance compared to the traditional text-only CoT paradigm and effectively
activates more neuron values in MLLMs.
comment: Accepted by ACM MM 2025
☆ Resolution Revolution: A Physics-Guided Deep Learning Framework for Spatiotemporal Temperature Reconstruction ICCV 2025
Central to Earth observation is the trade-off between spatial and temporal
resolution. For temperature, this is especially critical because real-world
applications require high spatiotemporal resolution data. Current technology
allows for hourly temperature observations at 2 km, but only every 16 days at
100 m, a gap further exacerbated by cloud cover. Earth system models offer
continuous hourly temperature data, but at a much coarser spatial resolution
(9-31 km). Here, we present a physics-guided deep learning framework for
temperature data reconstruction that integrates these two data sources. The
proposed framework uses a convolutional neural network that incorporates the
annual temperature cycle and includes a linear term to amplify the coarse Earth
system model output into fine-scale temperature values observed from
satellites. We evaluated this framework using data from two satellites, GOES-16
(2 km, hourly) and Landsat (100 m, every 16 days), and demonstrated effective
temperature reconstruction with hold-out and in situ data across four datasets.
This physics-guided deep learning framework opens new possibilities for
generating high-resolution temperature data across spatial and temporal scales,
under all weather conditions and globally.
comment: ICCV 2025 Workshop SEA -- International Conference on Computer Vision
2025 Workshop on Sustainability with Earth Observation and AI
☆ SpeakerVid-5M: A Large-Scale High-Quality Dataset for Audio-Visual Dyadic Interactive Human Generation
Youliang Zhang, Zhaoyang Li, Duomin Wang, Jiahe Zhang, Deyu Zhou, Zixin Yin, Xili Dai, Gang Yu, Xiu Li
The rapid development of large-scale models has catalyzed significant
breakthroughs in the digital human domain. These advanced methodologies offer
high-fidelity solutions for avatar driving and rendering, leading academia to
focus on the next major challenge: audio-visual dyadic interactive virtual
human. To facilitate research in this emerging area, we present SpeakerVid-5M
dataset, the first large-scale, high-quality dataset designed for audio-visual
dyadic interactive virtual human generation. Totaling over 8,743 hours,
SpeakerVid-5M contains more than 5.2 million video clips of human portraits. It
covers diverse scales and interaction types, including monadic talking,
listening, and dyadic conversations. Crucially, the dataset is structured along
two key dimensions: interaction type and data quality. First, it is categorized
into four types (dialogue branch, single branch, listening branch and
multi-turn branch) based on the interaction scenario. Second, it is stratified
into a large-scale pre-training subset and a curated, high-quality subset for
Supervised Fine-Tuning (SFT). This dual structure accommodates a wide array of
2D virtual human tasks. In addition, we provide an autoregressive (AR)-based
video chat baseline trained on this data, accompanied by a dedicated set of
metrics and test data to serve as a benchmark VidChatBench for future work.
Both the dataset and the corresponding data processing code will be publicly
released. Project page: https://dorniwang.github.io/SpeakerVid-5M/
☆ A Survey on MLLM-based Visually Rich Document Understanding: Methods, Challenges, and Emerging Trends
Visually-Rich Document Understanding (VRDU) has emerged as a critical field,
driven by the need to automatically process documents containing complex
visual, textual, and layout information. Recently, Multimodal Large Language
Models (MLLMs) have shown remarkable potential in this domain, leveraging both
Optical Character Recognition (OCR)-dependent and OCR-free frameworks to
extract and interpret information in document images. This survey reviews
recent advancements in MLLM-based VRDU, highlighting three core components: (1)
methods for encoding and fusing textual, visual, and layout features; (2)
training paradigms, including pretraining strategies, instruction-response
tuning, and the trainability of different model modules; and (3) datasets
utilized for pretraining, instruction-tuning, and supervised fine-tuning.
Finally, we discuss the challenges and opportunities in this evolving field and
propose future directions to advance the efficiency, generalizability, and
robustness of VRDU systems.
comment: Work in progress
☆ Generative Audio Language Modeling with Continuous-valued Tokens and Masked Next-Token Prediction ICML 2025
Shu-wen Yang, Byeonggeun Kim, Kuan-Po Huang, Qingming Tang, Huy Phan, Bo-Ru Lu, Harsha Sundar, Shalini Ghosh, Hung-yi Lee, Chieh-Chi Kao, Chao Wang
Autoregressive next-token prediction with the Transformer decoder has become
a de facto standard in large language models (LLMs), achieving remarkable
success in Natural Language Processing (NLP) at scale. Extending this paradigm
to audio poses unique challenges due to its inherently continuous nature. We
research audio generation with a causal language model (LM) without discrete
tokens. We leverage token-wise diffusion to model the continuous distribution
of the next continuous-valued token. Our approach delivers significant
improvements over previous discrete solution, AudioGen, achieving 20% and 40%
relative gains on AudioCaps in Frechet Audio Distance (FAD) and
Kullback-Leibler (KL) divergence, respectively. Additionally, we propose a
novel masked next-token prediction task that incorporates masked prediction
into the causal LM framework. On AudioCaps, the innovation yields 41% and 33%
relative FAD improvements over AudioGen Base (285M) and AudioGen Large (1B)
models, respectively, and is on par with the state-of-the-art (SOTA) diffusion
models. Furthermore, we achieve these results with significantly fewer
parameters -- 193M for our Base and 462M for our Large models.
comment: Accepted by ICML 2025. Project website: https://audiomntp.github.io/
♻ ☆ Visual Test-time Scaling for GUI Agent Grounding ICCV2025
We introduce RegionFocus, a visual test-time scaling approach for Vision
Language Model Agents. Understanding webpages is challenging due to the visual
complexity of GUI images and the large number of interface elements, making
accurate action selection difficult. Our approach dynamically zooms in on
relevant regions, reducing background clutter and improving grounding accuracy.
To support this process, we propose an image-as-map mechanism that visualizes
key landmarks at each step, providing a transparent action record and enables
the agent to effectively choose among action candidates. Even with a simple
region selection strategy, we observe significant performance gains of 28+\% on
Screenspot-pro and 24+\% on WebVoyager benchmarks on top of two
state-of-the-art open vision language model agents, UI-TARS and Qwen2.5-VL,
highlighting the effectiveness of visual test-time scaling in interactive
settings. We achieve a new state-of-the-art grounding performance of 61.6\% on
the ScreenSpot-Pro benchmark by applying RegionFocus to a Qwen2.5-VL-72B model.
Our code will be released publicly at https://github.com/tiangeluo/RegionFocus.
comment: ICCV2025, https://github.com/tiangeluo/RegionFocus
♻ ☆ UniQA: Unified Vision-Language Pre-training for Image Quality and Aesthetic Assessment
Hantao Zhou, Longxiang Tang, Rui Yang, Guanyi Qin, Yan Zhang, Yutao Li, Xiu Li, Runze Hu, Guangtao Zhai
Image Quality Assessment (IQA) and Image Aesthetic Assessment (IAA) aim to
simulate human subjective perception of image visual quality and aesthetic
appeal. Despite distinct learning objectives, they have underlying
interconnectedness due to consistent human assessment perception. In this
paper, we propose Unified vision-language pre-training of Quality and
Aesthetics (UniQA}), to extract useful and common representations from two
tasks, thereby benefiting them simultaneously. However, the lack of text in the
IQA datasets and the textual noise in the IAA datasets pose severe challenges
for multimodal pre-training. To address this, we (1) utilize multimodal large
language models (MLLMs) to generate high-quality text descriptions; (2) use the
generated text for IAA as metadata to purify noisy IAA data. To effectively
adapt the pre-trained UniQA to downstream tasks, we further propose a
lightweight adapter that utilizes versatile cues to fully exploit the extensive
knowledge of the pre-trained model. UniQA demonstrates high competitiveness in
various image assessment tasks, including classical IQA and IAA tasks,
few-label IQA, and other downstream tasks, showing promise as a foundational
assessment model. Codes are available at https://github.com/zht8506/UniQA.
♻ ☆ Gamma: Toward Generic Image Assessment with Mixture of Assessment Experts
Image assessment aims to evaluate the quality and aesthetics of images and
has been applied across various scenarios, such as natural and AIGC scenes.
Existing methods mostly address these sub-tasks or scenes individually. While
some works attempt to develop unified image assessment models, they have
struggled to achieve satisfactory performance or cover a broad spectrum of
assessment scenarios. In this paper, we present \textbf{Gamma}, a
\textbf{G}eneric im\textbf{A}ge assess\textbf{M}ent model using
\textbf{M}ixture of \textbf{A}ssessment Experts, which can effectively assess
images from diverse scenes through mixed-dataset training. Achieving unified
training in image assessment presents significant challenges due to annotation
biases across different datasets. To address this issue, we first propose a
Mixture of Assessment Experts (MoAE) module, which employs shared and adaptive
experts to dynamically learn common and specific knowledge for different
datasets, respectively. In addition, we introduce a Scene-based Differential
Prompt (SDP) strategy, which uses scene-specific prompts to provide prior
knowledge and guidance during the learning process, further boosting adaptation
for various scenes. Our Gamma model is trained and evaluated on 12 datasets
spanning 6 image assessment scenarios. Extensive experiments show that our
unified Gamma outperforms other state-of-the-art mixed-training methods by
significant margins while covering more scenes. Codes are available at
https://github.com/zht8506/Gamma.
comment: Accepted to ACMMM 2025
♻ ☆ GaussianOcc: Fully Self-supervised and Efficient 3D Occupancy Estimation with Gaussian Splatting
We introduce GaussianOcc, a systematic method that investigates the two
usages of Gaussian splatting for fully self-supervised and efficient 3D
occupancy estimation in surround views. First, traditional methods for
self-supervised 3D occupancy estimation still require ground truth 6D poses
from sensors during training. To address this limitation, we propose Gaussian
Splatting for Projection (GSP) module to provide accurate scale information for
fully self-supervised training from adjacent view projection. Additionally,
existing methods rely on volume rendering for final 3D voxel representation
learning using 2D signals (depth maps, semantic maps), which is both
time-consuming and less effective. We propose Gaussian Splatting from Voxel
space (GSV) to leverage the fast rendering properties of Gaussian splatting. As
a result, the proposed GaussianOcc method enables fully self-supervised (no
ground truth pose) 3D occupancy estimation in competitive performance with low
computational cost (2.7 times faster in training and 5 times faster in
rendering). The relevant code is available in
https://github.com/GANWANSHUI/GaussianOcc.git.
comment: Project page: https://ganwanshui.github.io/GaussianOcc/
♻ ☆ Alignment and Adversarial Robustness: Are More Human-Like Models More Secure? SP
A small but growing body of work has shown that machine learning models which
better align with human vision have also exhibited higher robustness to
adversarial examples, raising the question: can human-like perception make
models more secure? If true generally, such mechanisms would offer new avenues
toward robustness. In this work, we conduct a large-scale empirical analysis to
systematically investigate the relationship between representational alignment
and adversarial robustness. We evaluate 114 models spanning diverse
architectures and training paradigms, measuring their neural and behavioral
alignment and engineering task performance across 105 benchmarks as well as
their adversarial robustness via AutoAttack. Our findings reveal that while
average alignment and robustness exhibit a weak overall correlation, specific
alignment benchmarks serve as strong predictors of adversarial robustness,
particularly those that measure selectivity toward texture or shape. These
results suggest that different forms of alignment play distinct roles in model
robustness, motivating further investigation into how alignment-driven
approaches can be leveraged to build more secure and perceptually-grounded
vision models.
comment: Accepted to International Workshop on Security and Privacy-Preserving
AI/ML (SPAIML) 2025
♻ ☆ On the Robustness Tradeoff in Fine-Tuning ICCV 2025
Kunyang Li, Jean-Charles Noirot Ferrand, Ryan Sheatsley, Blaine Hoak, Yohan Beugin, Eric Pauley, Patrick McDaniel
Fine-tuning has become the standard practice for adapting pre-trained models
to downstream tasks. However, the impact on model robustness is not well
understood. In this work, we characterize the robustness-accuracy trade-off in
fine-tuning. We evaluate the robustness and accuracy of fine-tuned models over
6 benchmark datasets and 7 different fine-tuning strategies. We observe a
consistent trade-off between adversarial robustness and accuracy. Peripheral
updates such as BitFit are more effective for simple tasks -- over 75% above
the average measured by the area under the Pareto frontiers on CIFAR-10 and
CIFAR-100. In contrast, fine-tuning information-heavy layers, such as attention
layers via Compacter, achieves a better Pareto frontier on more complex tasks
-- 57.5% and 34.6% above the average on Caltech-256 and CUB-200, respectively.
Lastly, we observe that the robustness of fine-tuning against
out-of-distribution data closely tracks accuracy. These insights emphasize the
need for robustness-aware fine-tuning to ensure reliable real-world
deployments.
comment: Accepted to International Conference on Computer Vision, ICCV 2025
♻ ☆ Enabling Advanced Land Cover Analytics: An Integrated Data Extraction Pipeline for Predictive Modeling with the Dynamic World Dataset
Victor Radermecker, Andrea Zanon, Nancy Thomas, Annita Vapsi, Saba Rahimi, Rama Ramakrishnan, Daniel Borrajo
Understanding land cover holds considerable potential for a myriad of
practical applications, particularly as data accessibility transitions from
being exclusive to governmental and commercial entities to now including the
broader research community. Nevertheless, although the data is accessible to
any community member interested in exploration, there exists a formidable
learning curve and no standardized process for accessing, pre-processing, and
leveraging the data for subsequent tasks. In this study, we democratize this
data by presenting a flexible and efficient end to end pipeline for working
with the Dynamic World dataset, a cutting-edge near-real-time land use/land
cover (LULC) dataset. This includes a pre-processing and representation
framework which tackles noise removal, efficient extraction of large amounts of
data, and re-representation of LULC data in a format well suited for several
downstream tasks. To demonstrate the power of our pipeline, we use it to
extract data for an urbanization prediction problem and build a suite of
machine learning models with excellent performance. This task is easily
generalizable to the prediction of any type of land cover and our pipeline is
also compatible with a series of other downstream tasks.
♻ ☆ Random Erasing vs. Model Inversion: A Promising Defense or a False Hope?
Viet-Hung Tran, Ngoc-Bao Nguyen, Son T. Mai, Hans Vandierendonck, Ira Assent, Alex Kot, Ngai-Man Cheung
Model Inversion (MI) attacks pose a significant privacy threat by
reconstructing private training data from machine learning models. While
existing defenses primarily concentrate on model-centric approaches, the impact
of data on MI robustness remains largely unexplored. In this work, we explore
Random Erasing (RE), a technique traditionally used for improving model
generalization under occlusion, and uncover its surprising effectiveness as a
defense against MI attacks. Specifically, our novel feature space analysis
shows that models trained with RE-images introduce a significant discrepancy
between the features of MI-reconstructed images and those of the private data.
At the same time, features of private images remain distinct from other classes
and well-separated from different classification regions. These effects
collectively degrade MI reconstruction quality and attack accuracy while
maintaining reasonable natural accuracy. Furthermore, we explore two critical
properties of RE including Partial Erasure and Random Location. Partial Erasure
prevents the model from observing entire objects during training. We find this
has a significant impact on MI, which aims to reconstruct the entire objects.
Random Location of erasure plays a crucial role in achieving a strong
privacy-utility trade-off. Our findings highlight RE as a simple yet effective
defense mechanism that can be easily integrated with existing
privacy-preserving techniques. Extensive experiments across 37 setups
demonstrate that our method achieves state-of-the-art (SOTA) performance in the
privacy-utility trade-off. The results consistently demonstrate the superiority
of our defense over existing methods across different MI attacks, network
architectures, and attack configurations. For the first time, we achieve a
significant degradation in attack accuracy without a decrease in utility for
some configurations.
comment: Accepted in Transactions on Machine Learning Research (TMLR). First
two authors contributed equally
♻ ☆ Structure-Guided Diffusion Models for High-Fidelity Portrait Shadow Removal ICCV2025
We present a diffusion-based portrait shadow removal approach that can
robustly produce high-fidelity results. Unlike previous methods, we cast shadow
removal as diffusion-based inpainting. To this end, we first train a
shadow-independent structure extraction network on a real-world portrait
dataset with various synthetic lighting conditions, which allows to generate a
shadow-independent structure map including facial details while excluding the
unwanted shadow boundaries. The structure map is then used as condition to
train a structure-guided inpainting diffusion model for removing shadows in a
generative manner. Finally, to restore the fine-scale details (e.g., eyelashes,
moles and spots) that may not be captured by the structure map, we take the
gradients inside the shadow regions as guidance and train a detail restoration
diffusion model to refine the shadow removal result. Extensive experiments on
the benchmark datasets show that our method clearly outperforms existing
methods, and is effective to avoid previously common issues such as facial
identity tampering, shadow residual, color distortion, structure blurring, and
loss of details. Our code is available at
https://github.com/wanchang-yu/Structure-Guided-Diffusion-for-Portrait-Shadow-Removal.
comment: Accepted by ICCV2025
♻ ☆ MGA-Net: A Novel Mask-Guided Attention Neural Network for Precision Neonatal Brain Imaging
Bahram Jafrasteh, Simon Pedro Lubian-Lopez, Emiliano Trimarco, Macarena Roman Ruiz, Carmen Rodriguez Barrios, Yolanda Marin Almagro, Isabel Benavente-Fernandez
In this study, we introduce MGA-Net, a novel mask-guided attention neural
network, which extends the U-net model for precision neonatal brain imaging.
MGA-Net is designed to extract the brain from other structures and reconstruct
high-quality brain images. The network employs a common encoder and two
decoders: one for brain mask extraction and the other for brain region
reconstruction. A key feature of MGA-Net is its high-level mask-guided
attention module, which leverages features from the brain mask decoder to
enhance image reconstruction. To enable the same encoder and decoder to process
both MRI and ultrasound (US) images, MGA-Net integrates sinusoidal positional
encoding. This encoding assigns distinct positional values to MRI and US
images, allowing the model to effectively learn from both modalities.
Consequently, features learned from a single modality can aid in learning a
modality with less available data, such as US. We extensively validated the
proposed MGA-Net on diverse and independent datasets from varied clinical
settings and neonatal age groups. The metrics used for assessment included the
DICE similarity coefficient, recall, and accuracy for image segmentation;
structural similarity for image reconstruction; and root mean squared error for
total brain volume estimation from 3D ultrasound images. Our results
demonstrate that MGA-Net significantly outperforms traditional methods,
offering superior performance in brain extraction and segmentation while
achieving high precision in image reconstruction and volumetric analysis. Thus,
MGA-Net represents a robust and effective preprocessing tool for MRI and 3D
ultrasound images, marking a significant advance in neuroimaging that enhances
both research and clinical diagnostics in the neonatal period and beyond.Our
code is available at https://github.com/BahramJafrasteh/MGA-Net
♻ ☆ WASABI: A Metric for Evaluating Morphometric Plausibility of Synthetic Brain MRIs
Generative models enhance neuroimaging through data augmentation, quality
improvement, and rare condition studies. Despite advances in realistic
synthetic MRIs, evaluations focus on texture and perception, lacking
sensitivity to crucial anatomical fidelity. This study proposes a new metric,
called WASABI (Wasserstein-Based Anatomical Brain Index), to assess the
anatomical realism of synthetic brain MRIs. WASABI leverages \textit{SynthSeg},
a deep learning-based brain parcellation tool, to derive volumetric measures of
brain regions in each MRI and uses the multivariate Wasserstein distance to
compare distributions between real and synthetic anatomies. Based on controlled
experiments on two real datasets and synthetic MRIs from five generative
models, WASABI demonstrates higher sensitivity in quantifying anatomical
discrepancies compared to traditional image-level metrics, even when synthetic
images achieve near-perfect visual quality. Our findings advocate for shifting
the evaluation paradigm beyond visual inspection and conventional metrics,
emphasizing anatomical fidelity as a crucial benchmark for clinically
meaningful brain MRI synthesis. Our code is available at
https://github.com/BahramJafrasteh/wasabi-mri.
♻ ☆ PrefixKV: Adaptive Prefix KV Cache is What Vision Instruction-Following Models Need for Efficient Generation
Ao Wang, Hui Chen, Jiaxin Li, Jianchao Tan, Kefeng Zhang, Xunliang Cai, Zijia Lin, Jungong Han, Guiguang Ding
Recently, large vision-language models (LVLMs) have rapidly gained popularity
for their strong generation and reasoning capabilities given diverse multimodal
inputs. However, these models incur significant computational and memory
overhead during inference, which greatly hinders the efficient deployment in
practical scenarios. The extensive key-value (KV) cache, necessitated by the
lengthy input and output sequences, notably contributes to the high inference
cost. Based on this, recent works have investigated ways to reduce the KV cache
size for higher efficiency. Although effective, they generally overlook the
distinct importance distributions of KV vectors across layers and maintain the
same cache size for each layer during the next token prediction. This results
in the significant contextual information loss for certain layers, leading to
notable performance decline. To address this, we present PrefixKV. It reframes
the challenge of determining KV cache sizes for all layers into the task of
searching for the optimal global prefix configuration. With an adaptive
layer-wise KV retention recipe based on binary search, the maximum contextual
information can thus be preserved in each layer, facilitating the generation.
Extensive experiments demonstrate that our method achieves the state-of-the-art
performance compared with others. It exhibits superior inference efficiency and
generation quality trade-offs, showing promising potential for practical
applications. Code is available at https://github.com/THU-MIG/PrefixKV.
comment: 12 pages, 5 figures;
♻ ☆ BayesSDF: Surface-Based Laplacian Uncertainty Estimation for 3D Geometry with Neural Signed Distance Fields ICCV 2025
Quantifying uncertainty in neural implicit 3D representations, particularly
those utilizing Signed Distance Functions (SDFs), remains a substantial
challenge due to computational inefficiencies, scalability issues, and
geometric inconsistencies. Existing methods typically neglect direct geometric
integration, leading to poorly calibrated uncertainty maps. We introduce
BayesSDF, a novel probabilistic framework for uncertainty quantification in
neural implicit SDF models, motivated by scientific simulation applications
with 3D environments (e.g., forests) such as modeling fluid flow through
forests, where precise surface geometry and reliable uncertainty estimates are
essential. Unlike radiance-based models such as Neural Radiance Fields (NeRF)
or 3D Gaussian splatting, which lack explicit surface formulations, Signed
Distance Functions (SDFs) define continuous and differentiable geometry, making
them better suited for physical modeling and analysis. BayesSDF leverages a
Laplace approximation to quantify local surface instability using Hessian-based
metrics, enabling efficient, surfaceaware uncertainty estimation. Our method
shows that uncertainty predictions correspond closely with poorly reconstructed
geometry, providing actionable confidence measures for downstream use.
Extensive evaluations on synthetic and real-world datasets demonstrate that
BayesSDF outperforms existing methods in both calibration and geometric
consistency, establishing a strong foundation for uncertainty-aware 3D scene
reconstruction, simulation, and robotic decision-making.
comment: ICCV 2025 Workshops (8 Pages, 6 Figures, 2 Tables)
♻ ☆ Adapting OpenAI's CLIP Model for Few-Shot Image Inspection in Manufacturing Quality Control: An Expository Case Study with Multiple Application Examples
Fadel M. Megahed, Ying-Ju Chen, Bianca Maria Colosimo, Marco Luigi Giuseppe Grasso, L. Allison Jones-Farmer, Sven Knoth, Hongyue Sun, Inez Zwetsloot
This expository paper introduces a simplified approach to image-based quality
inspection in manufacturing using OpenAI's CLIP (Contrastive Language-Image
Pretraining) model adapted for few-shot learning. While CLIP has demonstrated
impressive capabilities in general computer vision tasks, its direct
application to manufacturing inspection presents challenges due to the domain
gap between its training data and industrial applications. We evaluate CLIP's
effectiveness through five case studies: metallic pan surface inspection, 3D
printing extrusion profile analysis, stochastic textured surface evaluation,
automotive assembly inspection, and microstructure image classification. Our
results show that CLIP can achieve high classification accuracy with relatively
small learning sets (50-100 examples per class) for single-component and
texture-based applications. However, the performance degrades with complex
multi-component scenes. We provide a practical implementation framework that
enables quality engineers to quickly assess CLIP's suitability for their
specific applications before pursuing more complex solutions. This work
establishes CLIP-based few-shot learning as an effective baseline approach that
balances implementation simplicity with robust performance, demonstrated in
several manufacturing quality control applications.
comment: 36 pages, 13 figures
♻ ☆ Cross-modal Ship Re-Identification via Optical and SAR Imagery: A Novel Dataset and Method ICCV 2025
Detecting and tracking ground objects using earth observation imagery remains
a significant challenge in the field of remote sensing. Continuous maritime
ship tracking is crucial for applications such as maritime search and rescue,
law enforcement, and shipping analysis. However, most current ship tracking
methods rely on geostationary satellites or video satellites. The former offer
low resolution and are susceptible to weather conditions, while the latter have
short filming durations and limited coverage areas, making them less suitable
for the real-world requirements of ship tracking. To address these limitations,
we present the Hybrid Optical and Synthetic Aperture Radar (SAR) Ship
Re-Identification Dataset (HOSS ReID dataset), designed to evaluate the
effectiveness of ship tracking using low-Earth orbit constellations of optical
and SAR sensors. This approach ensures shorter re-imaging cycles and enables
all-weather tracking. HOSS ReID dataset includes images of the same ship
captured over extended periods under diverse conditions, using different
satellites of different modalities at varying times and angles. Furthermore, we
propose a baseline method for cross-modal ship re-identification, TransOSS,
which is built on the Vision Transformer architecture. It refines the patch
embedding structure to better accommodate cross-modal tasks, incorporates
additional embeddings to introduce more reference information, and employs
contrastive learning to pre-train on large-scale optical-SAR image pairs,
ensuring the model's ability to extract modality-invariant features. Our
dataset and baseline method are publicly available on
https://github.com/Alioth2000/Hoss-ReID.
comment: Accepted to ICCV 2025
♻ ☆ SCOOTER: A Human Evaluation Framework for Unrestricted Adversarial Examples
Dren Fazlija, Monty-Maximilian Zühlke, Johanna Schrader, Arkadij Orlov, Clara Stein, Iyiola E. Olatunji, Daniel Kudenko
Unrestricted adversarial attacks aim to fool computer vision models without
being constrained by $\ell_p$-norm bounds to remain imperceptible to humans,
for example, by changing an object's color. This allows attackers to circumvent
traditional, norm-bounded defense strategies such as adversarial training or
certified defense strategies. However, due to their unrestricted nature, there
are also no guarantees of norm-based imperceptibility, necessitating human
evaluations to verify just how authentic these adversarial examples look. While
some related work assesses this vital quality of adversarial attacks, none
provide statistically significant insights. This issue necessitates a unified
framework that supports and streamlines such an assessment for evaluating and
comparing unrestricted attacks. To close this gap, we introduce SCOOTER - an
open-source, statistically powered framework for evaluating unrestricted
adversarial examples. Our contributions are: $(i)$ best-practice guidelines for
crowd-study power, compensation, and Likert equivalence bounds to measure
imperceptibility; $(ii)$ the first large-scale human vs. model comparison
across 346 human participants showing that three color-space attacks and three
diffusion-based attacks fail to produce imperceptible images. Furthermore, we
found that GPT-4o can serve as a preliminary test for imperceptibility, but it
only consistently detects adversarial examples for four out of six tested
attacks; $(iii)$ open-source software tools, including a browser-based task
template to collect annotations and analysis scripts in Python and R; $(iv)$ an
ImageNet-derived benchmark dataset containing 3K real images, 7K adversarial
examples, and over 34K human ratings. Our findings demonstrate that automated
vision systems do not align with human perception, reinforcing the need for a
ground-truth SCOOTER benchmark.
comment: 42 pages, 16 figures, 11 tables, Under Review, Code:
https://github.com/DrenFazlija/Scooter, Data:
https://doi.org/10.5281/zenodo.15771501
♻ ☆ AI-driven visual monitoring of industrial assembly tasks
Visual monitoring of industrial assembly tasks is critical for preventing
equipment damage due to procedural errors and ensuring worker safety. Although
commercial solutions exist, they typically require rigid workspace setups or
the application of visual markers to simplify the problem. We introduce ViMAT,
a novel AI-driven system for real-time visual monitoring of assembly tasks that
operates without these constraints. ViMAT combines a perception module that
extracts visual observations from multi-view video streams with a reasoning
module that infers the most likely action being performed based on the observed
assembly state and prior task knowledge. We validate ViMAT on two assembly
tasks, involving the replacement of LEGO components and the reconfiguration of
hydraulic press molds, demonstrating its effectiveness through quantitative and
qualitative analysis in challenging real-world scenarios characterized by
partial and uncertain visual observations. Project page:
https://tev-fbk.github.io/ViMAT
♻ ☆ Average Calibration Error: A Differentiable Loss for Improved Reliability in Image Segmentation
Deep neural networks for medical image segmentation often produce
overconfident results misaligned with empirical observations. Such
miscalibration, challenges their clinical translation. We propose to use
marginal L1 average calibration error (mL1-ACE) as a novel auxiliary loss
function to improve pixel-wise calibration without compromising segmentation
quality. We show that this loss, despite using hard binning, is directly
differentiable, bypassing the need for approximate but differentiable surrogate
or soft binning approaches. Our work also introduces the concept of dataset
reliability histograms which generalises standard reliability diagrams for
refined visual assessment of calibration in semantic segmentation aggregated at
the dataset level. Using mL1-ACE, we reduce average and maximum calibration
error by 45% and 55% respectively, maintaining a Dice score of 87% on the BraTS
2021 dataset. We share our code here: https://github.com/cai4cai/ACE-DLIRIS
comment: Camera ready version as in 10.1007/978-3-031-72114-4_14
♻ ☆ Screen Them All: High-Throughput Pan-Cancer Genetic and Phenotypic Biomarker Screening from H&E Whole Slide Images
Yi Kan Wang, Ludmila Tydlitatova, Jeremy D. Kunz, Gerard Oakley, Bonnie Kar Bo Chow, Ran A. Godrich, Matthew C. H. Lee, Hamed Aghdam, Alican Bozkurt, Michal Zelechowski, Chad Vanderbilt, Christopher Kanan, Juan A. Retamero, Peter Hamilton, Razik Yousfi, Thomas J. Fuchs, David S. Klimstra, Siqi Liu
Molecular assays are standard of care for detecting genomic alterations in
cancer prognosis and therapy selection but are costly, tissue-destructive and
time-consuming. Artificial intelligence (AI) applied to routine hematoxylin and
eosin (H&E)-stained whole slide images (WSIs) offers a fast and economical
alternative for screening molecular biomarkers. We introduce OmniScreen, a
high-throughput AI-based system leveraging Virchow2 embeddings extracted from
60,529 cancer patients with paired 489-gene MSK-IMPACT targeted biomarker panel
and WSIs. Unlike conventional approaches that train separate models for each
biomarker, OmniScreen employs a unified model to predict a broad range of
clinically relevant biomarkers across cancers, including low-prevalence targets
impractical to model individually. OmniScreen reliably identifies therapeutic
targets and shared phenotypic features across common and rare tumors. We
investigate the biomarker prediction probabilities and accuracies of OmniScreen
in relation to tumor area, cohort size, histologic subtype alignment, and
pathway-level morphological patterns. These findings underscore the potential
of OmniScreen for routine clinical screening.
♻ ☆ Imagine for Me: Creative Conceptual Blending of Real Images and Text via Blended Attention
Blending visual and textual concepts into a new visual concept is a unique
and powerful trait of human beings that can fuel creativity. However, in
practice, cross-modal conceptual blending for humans is prone to cognitive
biases, like design fixation, which leads to local minima in the design space.
In this paper, we propose a T2I diffusion adapter "IT-Blender" that can
automate the blending process to enhance human creativity. Prior works related
to cross-modal conceptual blending are limited in encoding a real image without
loss of details or in disentangling the image and text inputs. To address these
gaps, IT-Blender leverages pretrained diffusion models (SD and FLUX) to blend
the latent representations of a clean reference image with those of the noisy
generated image. Combined with our novel blended attention, IT-Blender encodes
the real reference image without loss of details and blends the visual concept
with the object specified by the text in a disentangled way. Our experiment
results show that IT-Blender outperforms the baselines by a large margin in
blending visual and textual concepts, shedding light on the new application of
image generative models to augment human creativity.
comment: Project website is available at https://imagineforme.github.io/
♻ ☆ RIPE: Reinforcement Learning on Unlabeled Image Pairs for Robust Keypoint Extraction ICCV 2025
We introduce RIPE, an innovative reinforcement learning-based framework for
weakly-supervised training of a keypoint extractor that excels in both
detection and description tasks. In contrast to conventional training regimes
that depend heavily on artificial transformations, pre-generated models, or 3D
data, RIPE requires only a binary label indicating whether paired images
represent the same scene. This minimal supervision significantly expands the
pool of training data, enabling the creation of a highly generalized and robust
keypoint extractor.
RIPE utilizes the encoder's intermediate layers for the description of the
keypoints with a hyper-column approach to integrate information from different
scales. Additionally, we propose an auxiliary loss to enhance the
discriminative capability of the learned descriptors.
Comprehensive evaluations on standard benchmarks demonstrate that RIPE
simplifies data preparation while achieving competitive performance compared to
state-of-the-art techniques, marking a significant advancement in robust
keypoint extraction and description. To support further research, we have made
our code publicly available at https://github.com/fraunhoferhhi/RIPE.
comment: ICCV 2025
♻ ☆ On the development of an AI performance and behavioural measures for teaching and classroom management
Andreea I. Niculescu, Jochen Ehnes, Chen Yi, Du Jiawei, Tay Chiat Pin, Joey Tianyi Zhou, Vigneshwaran Subbaraju, Teh Kah Kuan, Tran Huy Dat, John Komar, Gi Soong Chee, Kenneth Kwok
This paper presents a two-year research project focused on developing
AI-driven measures to analyze classroom dynamics, with particular emphasis on
teacher actions captured through multimodal sensor data. We applied real-time
data from classroom sensors and AI techniques to extract meaningful insights
and support teacher development. Key outcomes include a curated audio-visual
dataset, novel behavioral measures, and a proof-of-concept teaching review
dashboard. An initial evaluation with eight researchers from the National
Institute for Education (NIE) highlighted the system's clarity, usability, and
its non-judgmental, automated analysis approach -- which reduces manual
workloads and encourages constructive reflection. Although the current version
does not assign performance ratings, it provides an objective snapshot of
in-class interactions, helping teachers recognize and improve their
instructional strategies. Designed and tested in an Asian educational context,
this work also contributes a culturally grounded methodology to the growing
field of AI-based educational analytics.
comment: 7 pages, 10 figures, A video demonstration of the teacher trainer
dashboard can be accessed here: https://vimeo.com/1076482827
♻ ☆ Easi3R: Estimating Disentangled Motion from DUSt3R Without Training
Recent advances in DUSt3R have enabled robust estimation of dense point
clouds and camera parameters of static scenes, leveraging Transformer network
architectures and direct supervision on large-scale 3D datasets. In contrast,
the limited scale and diversity of available 4D datasets present a major
bottleneck for training a highly generalizable 4D model. This constraint has
driven conventional 4D methods to fine-tune 3D models on scalable dynamic video
data with additional geometric priors such as optical flow and depths. In this
work, we take an opposite path and introduce Easi3R, a simple yet efficient
training-free method for 4D reconstruction. Our approach applies attention
adaptation during inference, eliminating the need for from-scratch pre-training
or network fine-tuning. We find that the attention layers in DUSt3R inherently
encode rich information about camera and object motion. By carefully
disentangling these attention maps, we achieve accurate dynamic region
segmentation, camera pose estimation, and 4D dense point map reconstruction.
Extensive experiments on real-world dynamic videos demonstrate that our
lightweight attention adaptation significantly outperforms previous
state-of-the-art methods that are trained or finetuned on extensive dynamic
datasets. Our code is publicly available for research purpose at
https://easi3r.github.io/
comment: Page: https://easi3r.github.io/ Code:
https://github.com/Inception3D/Easi3R
♻ ☆ SLGaussian: Fast Language Gaussian Splatting in Sparse Views ACM MM 2025
3D semantic field learning is crucial for applications like autonomous
navigation, AR/VR, and robotics, where accurate comprehension of 3D scenes from
limited viewpoints is essential. Existing methods struggle under sparse view
conditions, relying on inefficient per-scene multi-view optimizations, which
are impractical for many real-world tasks. To address this, we propose
SLGaussian, a feed-forward method for constructing 3D semantic fields from
sparse viewpoints, allowing direct inference of 3DGS-based scenes. By ensuring
consistent SAM segmentations through video tracking and using low-dimensional
indexing for high-dimensional CLIP features, SLGaussian efficiently embeds
language information in 3D space, offering a robust solution for accurate 3D
scene understanding under sparse view conditions. In experiments on two-view
sparse 3D object querying and segmentation in the LERF and 3D-OVS datasets,
SLGaussian outperforms existing methods in chosen IoU, Localization Accuracy,
and mIoU. Moreover, our model achieves scene inference in under 30 seconds and
open-vocabulary querying in just 0.011 seconds per query.
comment: Accepted by ACM MM 2025. Project page:
https://chenkangjie1123.github.io/SLGaussian.github.io/
♻ ☆ CoMoGaussian: Continuous Motion-Aware Gaussian Splatting from Motion-Blurred Images
Jungho Lee, Donghyeong Kim, Dogyoon Lee, Suhwan Cho, Minhyeok Lee, Wonjoon Lee, Taeoh Kim, Dongyoon Wee, Sangyoun Lee
3D Gaussian Splatting (3DGS) has gained significant attention due to its
high-quality novel view rendering, motivating research to address real-world
challenges. A critical issue is the camera motion blur caused by movement
during exposure, which hinders accurate 3D scene reconstruction. In this study,
we propose CoMoGaussian, a Continuous Motion-Aware Gaussian Splatting that
reconstructs precise 3D scenes from motion-blurred images while maintaining
real-time rendering speed. Considering the complex motion patterns inherent in
real-world camera movements, we predict continuous camera trajectories using
neural ordinary differential equations (ODEs). To ensure accurate modeling, we
employ rigid body transformations, preserving the shape and size of the object
but rely on the discrete integration of sampled frames. To better approximate
the continuous nature of motion blur, we introduce a continuous motion
refinement (CMR) transformation that refines rigid transformations by
incorporating additional learnable parameters. By revisiting fundamental camera
theory and leveraging advanced neural ODE techniques, we achieve precise
modeling of continuous camera trajectories, leading to improved reconstruction
accuracy. Extensive experiments demonstrate state-of-the-art performance both
quantitatively and qualitatively on benchmark datasets, which include a wide
range of motion blur scenarios, from moderate to extreme blur.
comment: Revised Version of CRiM-GS, Project Page:
https://Jho-Yonsei.github.io/CoMoGaussian
♻ ☆ Sparfels: Fast Reconstruction from Sparse Unposed Imagery ICCV 2025
We present a method for Sparse view reconstruction with surface element
splatting that runs within 3 minutes on a consumer grade GPU. While few methods
address sparse radiance field learning from noisy or unposed sparse cameras,
shape recovery remains relatively underexplored in this setting. Several
radiance and shape learning test-time optimization methods address the sparse
posed setting by learning data priors or using combinations of external
monocular geometry priors. Differently, we propose an efficient and simple
pipeline harnessing a single recent 3D foundation model. We leverage its
various task heads, notably point maps and camera initializations to
instantiate a bundle adjusting 2D Gaussian Splatting (2DGS) model, and image
correspondences to guide camera optimization midst 2DGS training. Key to our
contribution is a novel formulation of splatted color variance along rays,
which can be computed efficiently. Reducing this moment in training leads to
more accurate shape reconstructions. We demonstrate state-of-the-art
performances in the sparse uncalibrated setting in reconstruction and novel
view benchmarks based on established multi-view datasets.
comment: ICCV 2025. Project page :
https://shubhendu-jena.github.io/Sparfels-web/
♻ ☆ Multispectral Detection Transformer with Infrared-Centric Feature Fusion
Multispectral object detection aims to leverage complementary information
from visible (RGB) and infrared (IR) modalities to enable robust performance
under diverse environmental conditions. Our key insight, derived from wavelet
analysis and empirical observations, is that IR images contain structurally
rich high-frequency information critical for object detection, making an
infrared-centric approach highly effective. To capitalize on this finding, we
propose Infrared-Centric Fusion (IC-Fusion), a lightweight and modality-aware
sensor fusion method that prioritizes infrared features while effectively
integrating complementary RGB semantic context. IC-Fusion adopts a compact RGB
backbone and designs a novel fusion module comprising a Multi-Scale Feature
Distillation (MSFD) block to enhance RGB features and a three-stage fusion
block with a Cross-Modal Channel Shuffle Gate (CCSG), a Cross-Modal Large
Kernel Gate (CLKG), and a Channel Shuffle Projection (CSP) to facilitate
effective cross-modal interaction. Experiments on the FLIR and LLVIP benchmarks
demonstrate the superior effectiveness and efficiency of our IR-centric fusion
strategy, further validating its benefits. Our code is available at
https://github.com/smin-hwang/IC-Fusion.
comment: Under Review
♻ ☆ Advancing Automatic Photovoltaic Defect Detection using Semi-Supervised Semantic Segmentation of Electroluminescence Images
Photovoltaic (PV) systems allow us to tap into all abundant solar energy,
however they require regular maintenance for high efficiency and to prevent
degradation. Traditional manual health check, using Electroluminescence (EL)
imaging, is expensive and logistically challenging which makes automated defect
detection essential. Current automation approaches require extensive manual
expert labeling, which is time-consuming, expensive, and prone to errors. We
propose PV-S3 (Photovoltaic-Semi-supervised Semantic Segmentation), a
Semi-Supervised Learning approach for semantic segmentation of defects in EL
images that reduces reliance on extensive labeling. PV-S3 is an artificial
intelligence (AI) model trained using a few labeled images along with numerous
unlabeled images. We introduce a novel Semi Cross-Entropy loss function to deal
with class imbalance. We evaluate PV-S3 on multiple datasets and demonstrate
its effectiveness and adaptability. With merely 20% labeled samples, we achieve
an absolute improvement of 9.7% in mean Intersection-over-Union (mIoU), 13.5%
in Precision, 29.15% in Recall, and 20.42% in F1-Score over prior
state-of-the-art supervised method (which uses 100% labeled samples) on
University of Central Florida-Electroluminescence (UCF-EL) dataset (largest
dataset available for semantic segmentation of EL images) showing improvement
in performance while reducing the annotation costs by 80%. For more details,
visit our GitHub repository: https://github.com/abj247/PV-S3.
comment: 19 pages, 10 figures
♻ ☆ M2DAO-Talker: Harmonizing Multi-granular Motion Decoupling and Alternating Optimization for Talking-head Generation
Audio-driven talking head generation holds significant potential for film
production. While existing 3D methods have advanced motion modeling and content
synthesis, they often produce rendering artifacts, such as motion blur,
temporal jitter, and local penetration, due to limitations in representing
stable, fine-grained motion fields. Through systematic analysis, we reformulate
talking head generation into a unified framework comprising three steps: video
preprocessing, motion representation, and rendering reconstruction. This
framework underpins our proposed M2DAO-Talker, which addresses current
limitations via multi-granular motion decoupling and alternating optimization.
Specifically, we devise a novel 2D portrait preprocessing pipeline to extract
frame-wise deformation control conditions (motion region segmentation masks,
and camera parameters) to facilitate motion representation. To ameliorate
motion modeling, we elaborate a multi-granular motion decoupling strategy,
which independently models non-rigid (oral and facial) and rigid (head) motions
for improved reconstruction accuracy. Meanwhile, a motion consistency
constraint is developed to ensure head-torso kinematic consistency, thereby
mitigating penetration artifacts caused by motion aliasing. In addition, an
alternating optimization strategy is designed to iteratively refine facial and
oral motion parameters, enabling more realistic video generation. Experiments
across multiple datasets show that M2DAO-Talker achieves state-of-the-art
performance, with the 2.43 dB PSNR improvement in generation quality and 0.64
gain in user-evaluated video realness versus TalkingGaussian while with 150 FPS
inference speed. Our project homepage is
https://m2dao-talker.github.io/M2DAO-Talk.github.io.
♻ ☆ Barriers in Integrating Medical Visual Question Answering into Radiology Workflows: A Scoping Review and Clinicians' Insights
Medical Visual Question Answering (MedVQA) is a promising tool to assist
radiologists by automating medical image interpretation through question
answering. Despite advances in models and datasets, MedVQA's integration into
clinical workflows remains limited. This study systematically reviews 68
publications (2018-2024) and surveys 50 clinicians from India and Thailand to
examine MedVQA's practical utility, challenges, and gaps. Following the Arksey
and O'Malley scoping review framework, we used a two-pronged approach: (1)
reviewing studies to identify key concepts, advancements, and research gaps in
radiology workflows, and (2) surveying clinicians to capture their perspectives
on MedVQA's clinical relevance. Our review reveals that nearly 60% of QA pairs
are non-diagnostic and lack clinical relevance. Most datasets and models do not
support multi-view, multi-resolution imaging, EHR integration, or domain
knowledge, features essential for clinical diagnosis. Furthermore, there is a
clear mismatch between current evaluation metrics and clinical needs. The
clinician survey confirms this disconnect: only 29.8% consider MedVQA systems
highly useful. Key concerns include the absence of patient history or domain
knowledge (87.2%), preference for manually curated datasets (51.1%), and the
need for multi-view image support (78.7%). Additionally, 66% favor models
focused on specific anatomical regions, and 89.4% prefer dialogue-based
interactive systems. While MedVQA shows strong potential, challenges such as
limited multimodal analysis, lack of patient context, and misaligned evaluation
approaches must be addressed for effective clinical integration.
comment: 29 pages, 5 figures (1 in supplementary), 3 tables (1 in main text, 2
in supplementary). Scoping review and clinician survey
♻ ☆ LIRA: Inferring Segmentation in Large Multi-modal Models with Local Interleaved Region Assistance ICCV 2025
Zhang Li, Biao Yang, Qiang Liu, Shuo Zhang, Zhiyin Ma, Shuo Zhang, Liang Yin, Linger Deng, Yabo Sun, Yuliang Liu, Xiang Bai
While large multi-modal models (LMMs) demonstrate promising capabilities in
segmentation and comprehension, they still struggle with two limitations:
inaccurate segmentation and hallucinated comprehension. These challenges stem
primarily from constraints in weak visual comprehension and a lack of
fine-grained perception. To alleviate these limitations, we propose LIRA, a
framework that capitalizes on the complementary relationship between visual
comprehension and segmentation via two key components: (1) Semantic-Enhanced
Feature Extractor (SEFE) improves object attribute inference by fusing semantic
and pixel-level features, leading to more accurate segmentation; (2)
Interleaved Local Visual Coupling (ILVC) autoregressively generates local
descriptions after extracting local features based on segmentation masks,
offering fine-grained supervision to mitigate hallucinations. Furthermore, we
find that the precision of object segmentation is positively correlated with
the latent related semantics of the token. To quantify this relationship
and the model's potential semantic inferring ability, we introduce the
Attributes Evaluation (AttrEval) dataset. Our experiments show that LIRA
achieves state-of-the-art performance in both segmentation and comprehension
tasks. Code will be available at https://github.com/echo840/LIRA.
comment: ICCV 2025
♻ ☆ A Survey on Future Frame Synthesis: Bridging Deterministic and Generative Approaches
Ruibo Ming, Zhewei Huang, Jingwei Wu, Zhuoxuan Ju, Daxin Jiang, Jianming Hu, Lihui Peng, Shuchang Zhou
Future Frame Synthesis (FFS), the task of generating subsequent video frames
from context, represents a core challenge in machine intelligence and a
cornerstone for developing predictive world models. This survey provides a
comprehensive analysis of the FFS landscape, charting its critical evolution
from deterministic algorithms focused on pixel-level accuracy to modern
generative paradigms that prioritize semantic coherence and dynamic
plausibility. We introduce a novel taxonomy organized by algorithmic
stochasticity, which not only categorizes existing methods but also reveals the
fundamental drivers--advances in architectures, datasets, and computational
scale--behind this paradigm shift. Critically, our analysis identifies a
bifurcation in the field's trajectory: one path toward efficient, real-time
prediction, and another toward large-scale, generative world simulation. By
pinpointing key challenges and proposing concrete research questions for both
frontiers, this survey serves as an essential guide for researchers aiming to
advance the frontiers of visual dynamic modeling.
comment: TMLR 2025/07
♻ ☆ MGVQ: Could VQ-VAE Beat VAE? A Generalizable Tokenizer with Multi-group Quantization
Vector Quantized Variational Autoencoders (VQ-VAEs) are fundamental models
that compress continuous visual data into discrete tokens. Existing methods
have tried to improve the quantization strategy for better reconstruction
quality, however, there still exists a large gap between VQ-VAEs and VAEs. To
narrow this gap, we propose MGVQ, a novel method to augment the representation
capability of discrete codebooks, facilitating easier optimization for
codebooks and minimizing information loss, thereby enhancing reconstruction
quality. Specifically, we propose to retain the latent dimension to preserve
encoded features and incorporate a set of sub-codebooks for quantization.
Furthermore, we construct comprehensive zero-shot benchmarks featuring
resolutions of 512p and 2k to evaluate the reconstruction performance of
existing methods rigorously. MGVQ achieves the state-of-the-art performance on
both ImageNet and 8 zero-shot benchmarks across all VQ-VAEs. Notably, compared
with SD-VAE, we outperform them on ImageNet significantly, with rFID 0.49 v.s.
0.91, and achieve superior PSNR on all zero-shot benchmarks. These results
highlight the superiority of MGVQ in reconstruction and pave the way for
preserving fidelity in HD image processing tasks. Code will be publicly
available at https://github.com/MKJia/MGVQ.
♻ ☆ Explaining the Impact of Training on Vision Models via Activation Clustering
This paper introduces Neuro-Activated Vision Explanations (NAVE), a method
for extracting and visualizing the internal representations of vision model
encoders. By clustering feature activations, NAVE provides insights into
learned semantics without fine-tuning. Using object localization, we show that
NAVE's concepts align with image semantics. Through extensive experiments, we
analyze the impact of training strategies and architectures on encoder
representation capabilities. Additionally, we apply NAVE to study training
artifacts in vision transformers and reveal how weak training strategies and
spurious correlations degrade model performance. Our findings establish NAVE as
a valuable tool for post-hoc model inspection and improving transparency in
vision models.
♻ ☆ HGSLoc: 3DGS-based Heuristic Camera Pose Refinement
Visual localization refers to the process of determining camera poses and
orientation within a known scene representation. This task is often complicated
by factors such as changes in illumination and variations in viewing angles. In
this paper, we propose HGSLoc, a novel lightweight plug-and-play pose
optimization framework, which integrates 3D reconstruction with a heuristic
refinement strategy to achieve higher pose estimation accuracy. Specifically,
we introduce an explicit geometric map for 3D representation and high-fidelity
rendering, allowing the generation of high-quality synthesized views to support
accurate visual localization. Our method demonstrates higher localization
accuracy compared to NeRF-based neural rendering localization approaches. We
introduce a heuristic refinement strategy, its efficient optimization
capability can quickly locate the target node, while we set the step level
optimization step to enhance the pose accuracy in the scenarios with small
errors. With carefully designed heuristic functions, it offers efficient
optimization capabilities, enabling rapid error reduction in rough localization
estimations. Our method mitigates the dependence on complex neural network
models while demonstrating improved robustness against noise and higher
localization accuracy in challenging environments, as compared to neural
network joint optimization strategies. The optimization framework proposed in
this paper introduces novel approaches to visual localization by integrating
the advantages of 3D reconstruction and the heuristic refinement strategy,
which demonstrates strong performance across multiple benchmark datasets,
including 7Scenes and Deep Blending dataset. The implementation of our method
has been released at https://github.com/anchang699/HGSLoc.
♻ ☆ Class-Aware PillarMix: Can Mixed Sample Data Augmentation Enhance 3D Object Detection with Radar Point Clouds? IROS 2025
Due to the significant effort required for data collection and annotation in
3D perception tasks, mixed sample data augmentation (MSDA) has been widely
studied to generate diverse training samples by mixing existing data. Recently,
many MSDA techniques have been developed for point clouds, but they mainly
target LiDAR data, leaving their application to radar point clouds largely
unexplored. In this paper, we examine the feasibility of applying existing MSDA
methods to radar point clouds and identify several challenges in adapting these
techniques. These obstacles stem from the radar's irregular angular
distribution, deviations from a single-sensor polar layout in multi-radar
setups, and point sparsity. To address these issues, we propose Class-Aware
PillarMix (CAPMix), a novel MSDA approach that applies MixUp at the pillar
level in 3D point clouds, guided by class labels. Unlike methods that rely a
single mix ratio to the entire sample, CAPMix assigns an independent ratio to
each pillar, boosting sample diversity. To account for the density of different
classes, we use class-specific distributions: for dense objects (e.g., large
vehicles), we skew ratios to favor points from another sample, while for sparse
objects (e.g., pedestrians), we sample more points from the original. This
class-aware mixing retains critical details and enriches each sample with new
information, ultimately generating more diverse training data. Experimental
results demonstrate that our method not only significantly boosts performance
but also outperforms existing MSDA approaches across two datasets (Bosch Street
and K-Radar). We believe that this straightforward yet effective approach will
spark further investigation into MSDA techniques for radar data.
comment: 8 pages, 6 figures, 4 tables, accepted to 2025 IEEE/RSJ International
Conference on Intelligent Robots and Systems (IROS 2025)
♻ ☆ Following the Clues: Experiments on Person Re-ID using Cross-Modal Intelligence IEEE 28
The collection and release of street-level recordings as Open Data play a
vital role in advancing autonomous driving systems and AI research. However,
these datasets pose significant privacy risks, particularly for pedestrians,
due to the presence of Personally Identifiable Information (PII) that extends
beyond biometric traits such as faces. In this paper, we present cRID, a novel
cross-modal framework combining Large Vision-Language Models, Graph Attention
Networks, and representation learning to detect textual describable clues of
PII and enhance person re-identification (Re-ID). Our approach focuses on
identifying and leveraging interpretable features, enabling the detection of
semantically meaningful PII beyond low-level appearance cues. We conduct a
systematic evaluation of PII presence in person image datasets. Our experiments
show improved performance in practical cross-dataset Re-ID scenarios, notably
from Market-1501 to CUHK03-np (detected), highlighting the framework's
practical utility. Code is available at https://github.com/RAufschlaeger/cRID.
comment: accepted for publication at the 2025 IEEE 28th International
Conference on Intelligent Transportation Systems (ITSC 2025), taking place
during November 18-21, 2025 in Gold Coast, Australia
♻ ☆ CVVNet: A Cross-Vertical-View Network for Gait Recognition
Gait recognition enables contact-free, long-range person identification that
is robust to clothing variations and non-cooperative scenarios. While existing
methods perform well in controlled indoor environments, they struggle with
cross-vertical view scenarios, where surveillance angles vary significantly in
elevation. Our experiments show up to 60\% accuracy degradation in low-to-high
vertical view settings due to severe deformations and self-occlusions of key
anatomical features. Current CNN and self-attention-based methods fail to
effectively handle these challenges, due to their reliance on single-scale
convolutions or simplistic attention mechanisms that lack effective
multi-frequency feature integration. To tackle this challenge, we propose
CVVNet (Cross-Vertical-View Network), a frequency aggregation architecture
specifically designed for robust cross-vertical-view gait recognition. CVVNet
employs a High-Low Frequency Extraction module (HLFE) that adopts parallel
multi-scale convolution/max-pooling path and self-attention path as high- and
low-frequency mixers for effective multi-frequency feature extraction from
input silhouettes. We also introduce the Dynamic Gated Aggregation (DGA)
mechanism to adaptively adjust the fusion ratio of high- and low-frequency
features. The integration of our core Multi-Scale Attention Gated Aggregation
(MSAGA) module, HLFE and DGA enables CVVNet to effectively handle distortions
from view changes, significantly improving the recognition robustness across
different vertical views. Experimental results show that our CVVNet achieves
state-of-the-art performance, with $8.6\%$ improvement on DroneGait and $2\%$
on Gait3D compared with the best existing methods.
♻ ☆ ECORE: Energy-Conscious Optimized Routing for Deep Learning Models at the Edge
Edge computing enables data processing closer to the source, significantly
reducing latency an essential requirement for real-time vision-based analytics
such as object detection in surveillance and smart city environments. However,
these tasks place substantial demands on resource constrained edge devices,
making the joint optimization of energy consumption and detection accuracy
critical. To address this challenge, we propose ECORE, a framework that
integrates multiple dynamic routing strategies including estimation based
techniques and a greedy selection algorithm to direct image processing requests
to the most suitable edge device-model pair. ECORE dynamically balances energy
efficiency and detection performance based on object characteristics. We
evaluate our approach through extensive experiments on real-world datasets,
comparing the proposed routers against widely used baseline techniques. The
evaluation leverages established object detection models (YOLO, SSD,
EfficientDet) and diverse edge platforms, including Jetson Orin Nano, Raspberry
Pi 4 and 5, and TPU accelerators. Results demonstrate that our proposed
context-aware routing strategies can reduce energy consumption and latency by
45% and 49%, respectively, while incurring only a 2% loss in detection accuracy
compared to accuracy-centric methods.
♻ ☆ Beyond Appearance: Geometric Cues for Robust Video Instance Segmentation ICCV 2025
Video Instance Segmentation (VIS) fundamentally struggles with pervasive
challenges including object occlusions, motion blur, and appearance variations
during temporal association. To overcome these limitations, this work
introduces geometric awareness to enhance VIS robustness by strategically
leveraging monocular depth estimation. We systematically investigate three
distinct integration paradigms. Expanding Depth Channel (EDC) method
concatenates the depth map as input channel to segmentation networks; Sharing
ViT (SV) designs a uniform ViT backbone, shared between depth estimation and
segmentation branches; Depth Supervision (DS) makes use of depth prediction as
an auxiliary training guide for feature learning. Though DS exhibits limited
effectiveness, benchmark evaluations demonstrate that EDC and SV significantly
enhance the robustness of VIS. When with Swin-L backbone, our EDC method gets
56.2 AP, which sets a new state-of-the-art result on OVIS benchmark. This work
conclusively establishes depth cues as critical enablers for robust video
understanding.
comment: Accepted by ICCV 2025 Workshop LSVOS
♻ ☆ Information-Bottleneck Driven Binary Neural Network for Change Detection ICCV 2025
In this paper, we propose Binarized Change Detection (BiCD), the first binary
neural network (BNN) designed specifically for change detection. Conventional
network binarization approaches, which directly quantize both weights and
activations in change detection models, severely limit the network's ability to
represent input data and distinguish between changed and unchanged regions.
This results in significantly lower detection accuracy compared to real-valued
networks. To overcome these challenges, BiCD enhances both the representational
power and feature separability of BNNs, improving detection performance.
Specifically, we introduce an auxiliary objective based on the Information
Bottleneck (IB) principle, guiding the encoder to retain essential input
information while promoting better feature discrimination. Since directly
computing mutual information under the IB principle is intractable, we design a
compact, learnable auxiliary module as an approximation target, leading to a
simple yet effective optimization strategy that minimizes both reconstruction
loss and standard change detection loss. Extensive experiments on street-view
and remote sensing datasets demonstrate that BiCD establishes a new benchmark
for BNN-based change detection, achieving state-of-the-art performance in this
domain.
comment: ICCV 2025 Accepted
♻ ☆ Re-boosting Self-Collaboration Parallel Prompt GAN for Unsupervised Image Restoration IEEE
Unsupervised restoration approaches based on generative adversarial networks
(GANs) offer a promising solution without requiring paired datasets. Yet, these
GAN-based approaches struggle to surpass the performance of conventional
unsupervised GAN-based frameworks without significantly modifying model
structures or increasing the computational complexity. To address these issues,
we propose a self-collaboration (SC) strategy for existing restoration models.
This strategy utilizes information from the previous stage as feedback to guide
subsequent stages, achieving significant performance improvement without
increasing the framework's inference complexity. The SC strategy comprises a
prompt learning (PL) module and a restorer ($Res$). It iteratively replaces the
previous less powerful fixed restorer $\overline{Res}$ in the PL module with a
more powerful $Res$. The enhanced PL module generates better
pseudo-degraded/clean image pairs, leading to a more powerful $Res$ for the
next iteration. Our SC can significantly improve the $Res$'s performance by
over 1.5 dB without adding extra parameters or computational complexity during
inference. Meanwhile, existing self-ensemble (SE) and our SC strategies enhance
the performance of pre-trained restorers from different perspectives. As SE
increases computational complexity during inference, we propose a re-boosting
module to the SC (Reb-SC) to improve the SC strategy further by incorporating
SE into SC without increasing inference time. This approach further enhances
the restorer's performance by approximately 0.3 dB. Extensive experimental
results on restoration tasks demonstrate that the proposed model performs
favorably against existing state-of-the-art unsupervised restoration methods.
Source code and trained models are publicly available at:
https://github.com/linxin0/RSCP2GAN.
comment: Accepted in IEEE T-PAMI
♻ ☆ Pathfinder for Low-altitude Aircraft with Binary Neural Network
A prior global topological map (e.g., the OpenStreetMap, OSM) can boost the
performance of autonomous mapping by a ground mobile robot. However, the prior
map is usually incomplete due to lacking labeling in partial paths. To solve
this problem, this paper proposes an OSM maker using airborne sensors carried
by low-altitude aircraft, where the core of the OSM maker is a novel efficient
pathfinder approach based on LiDAR and camera data, i.e., a binary dual-stream
road segmentation model. Specifically, a multi-scale feature extraction based
on the UNet architecture is implemented for images and point clouds. To reduce
the effect caused by the sparsity of point cloud, an attention-guided gated
block is designed to integrate image and point-cloud features. To optimize the
model for edge deployment that significantly reduces storage footprint and
computational demands, we propose a binarization streamline to each model
component, including a variant of vision transformer (ViT) architecture as the
encoder of the image branch, and new focal and perception losses to optimize
the model training. The experimental results on two datasets demonstrate that
our pathfinder method achieves SOTA accuracy with high efficiency in finding
paths from the low-level airborne sensors, and we can create complete OSM prior
maps based on the segmented road skeletons. Code and data are available at:
\href{https://github.com/IMRL/Pathfinder}{https://github.com/IMRL/Pathfinder}.
♻ ☆ VIVID-10M: A Dataset and Baseline for Versatile and Interactive Video Local Editing
Diffusion-based image editing models have made remarkable progress in recent
years. However, achieving high-quality video editing remains a significant
challenge. One major hurdle is the absence of open-source, large-scale video
editing datasets based on real-world data, as constructing such datasets is
both time-consuming and costly. Moreover, video data requires a significantly
larger number of tokens for representation, which substantially increases the
training costs for video editing models. Lastly, current video editing models
offer limited interactivity, often making it difficult for users to express
their editing requirements effectively in a single attempt. To address these
challenges, this paper introduces a dataset VIVID-10M and a baseline model
VIVID. VIVID-10M is the first large-scale hybrid image-video local editing
dataset aimed at reducing data construction and model training costs, which
comprises 9.7M samples that encompass a wide range of video editing tasks.
VIVID is a Versatile and Interactive VIdeo local eDiting model trained on
VIVID-10M, which supports entity addition, modification, and deletion. At its
core, a keyframe-guided interactive video editing mechanism is proposed,
enabling users to iteratively edit keyframes and propagate it to other frames,
thereby reducing latency in achieving desired outcomes. Extensive experimental
evaluations show that our approach achieves state-of-the-art performance in
video local editing, surpassing baseline methods in both automated metrics and
user studies. The VIVID-10M dataset are open-sourced at
https://kwaivgi.github.io/VIVID/.
comment: 10 pages, 10 figures
♻ ☆ SpatialViz-Bench: Automatically Generated Spatial Visualization Reasoning Tasks for MLLMs
Humans can directly imagine and manipulate visual images in their minds, a
capability known as spatial visualization. While multi-modal Large Language
Models (MLLMs) support imagination-based reasoning, spatial visualization
remains insufficiently evaluated, typically embedded within broader
mathematical and logical assessments. Existing evaluations often rely on IQ
tests or math competitions that may overlap with training data, compromising
assessment reliability. To this end, we introduce SpatialViz-Bench, a
comprehensive multi-modal benchmark for spatial visualization with 12 tasks
across 4 sub-abilities, comprising 1,180 automatically generated problems. Our
evaluation of 33 state-of-the-art MLLMs not only reveals wide performance
variations and demonstrates the benchmark's strong discriminative power, but
also uncovers counter-intuitive findings: models exhibit unexpected behaviors
by showing difficulty perception that misaligns with human intuition,
displaying dramatic 2D-to-3D performance cliffs, and defaulting to formula
derivation despite spatial tasks requiring visualization alone. SpatialVizBench
empirically demonstrates that state-of-the-art MLLMs continue to exhibit
deficiencies in spatial visualization tasks, thereby addressing a significant
lacuna in the field. The benchmark is publicly available.
♻ ☆ AGAV-Rater: Adapting Large Multimodal Model for AI-Generated Audio-Visual Quality Assessment
Many video-to-audio (VTA) methods have been proposed for dubbing silent
AI-generated videos. An efficient quality assessment method for AI-generated
audio-visual content (AGAV) is crucial for ensuring audio-visual quality.
Existing audio-visual quality assessment methods struggle with unique
distortions in AGAVs, such as unrealistic and inconsistent elements. To address
this, we introduce AGAVQA-3k, the first large-scale AGAV quality assessment
dataset, comprising $3,382$ AGAVs from $16$ VTA methods. AGAVQA-3k includes two
subsets: AGAVQA-MOS, which provides multi-dimensional scores for audio quality,
content consistency, and overall quality, and AGAVQA-Pair, designed for optimal
AGAV pair selection. We further propose AGAV-Rater, a LMM-based model that can
score AGAVs, as well as audio and music generated from text, across multiple
dimensions, and selects the best AGAV generated by VTA methods to present to
the user. AGAV-Rater achieves state-of-the-art performance on AGAVQA-3k,
Text-to-Audio, and Text-to-Music datasets. Subjective tests also confirm that
AGAV-Rater enhances VTA performance and user experience. The dataset and code
is available at https://github.com/charlotte9524/AGAV-Rater.
♻ ☆ Deflickering Vision-Based Occupancy Networks through Lightweight Spatio-Temporal Correlation
Vision-based occupancy networks (VONs) provide an end-to-end solution for
reconstructing 3D environments in autonomous driving. However, existing methods
often suffer from temporal inconsistencies, manifesting as flickering effects
that compromise visual experience and adversely affect decision-making. While
recent approaches have incorporated historical data to mitigate the issue, they
often incur high computational costs and may introduce noisy information that
interferes with object detection. We propose OccLinker, a novel plugin
framework designed to seamlessly integrate with existing VONs for boosting
performance. Our method efficiently consolidates historical static and motion
cues, learns sparse latent correlations with current features through a dual
cross-attention mechanism, and produces correction occupancy components to
refine the base network's predictions. We propose a new temporal consistency
metric to quantitatively identify flickering effects. Extensive experiments on
two benchmark datasets demonstrate that our method delivers superior
performance with negligible computational overhead, while effectively
eliminating flickering artifacts.
♻ ☆ MG-Gen: Single Image to Motion Graphics Generation
We introduce MG-Gen, a framework that generates motion graphics directly from
a single raster image. MG-Gen decompose a single raster image into layered
structures represented as HTML, generate animation scripts for each layer, and
then render them into a video. Experiments confirm MG-Gen generates dynamic
motion graphics while preserving text readability and fidelity to the input
conditions, whereas state-of-the-art image-to-video generation methods struggle
with them. The code is available at https://github.com/CyberAgentAILab/MG-GEN.
♻ ☆ DNF-Intrinsic: Deterministic Noise-Free Diffusion for Indoor Inverse Rendering ICCV2025
Recent methods have shown that pre-trained diffusion models can be fine-tuned
to enable generative inverse rendering by learning image-conditioned
noise-to-intrinsic mapping. Despite their remarkable progress, they struggle to
robustly produce high-quality results as the noise-to-intrinsic paradigm
essentially utilizes noisy images with deteriorated structure and appearance
for intrinsic prediction, while it is common knowledge that structure and
appearance information in an image are crucial for inverse rendering. To
address this issue, we present DNF-Intrinsic, a robust yet efficient inverse
rendering approach fine-tuned from a pre-trained diffusion model, where we
propose to take the source image rather than Gaussian noise as input to
directly predict deterministic intrinsic properties via flow matching.
Moreover, we design a generative renderer to constrain that the predicted
intrinsic properties are physically faithful to the source image. Experiments
on both synthetic and real-world datasets show that our method clearly
outperforms existing state-of-the-art methods.
comment: Accepted to ICCV2025
♻ ☆ De-Fake: Style based Anomaly Deepfake Detection
Detecting deepfakes involving face-swaps presents a significant challenge,
particularly in real-world scenarios where anyone can perform face-swapping
with freely available tools and apps without any technical knowledge. Existing
deepfake detection methods rely on facial landmarks or inconsistencies in
pixel-level features and often struggle with face-swap deepfakes, where the
source face is seamlessly blended into the target image or video. The
prevalence of face-swap is evident in everyday life, where it is used to spread
false information, damage reputations, manipulate political opinions, create
non-consensual intimate deepfakes (NCID), and exploit children by enabling the
creation of child sexual abuse material (CSAM). Even prominent public figures
are not immune to its impact, with numerous deepfakes of them circulating
widely across social media platforms. Another challenge faced by deepfake
detection methods is the creation of datasets that encompass a wide range of
variations, as training models require substantial amounts of data. This raises
privacy concerns, particularly regarding the processing and storage of personal
facial data, which could lead to unauthorized access or misuse. Our key idea is
to identify these style discrepancies to detect face-swapped images effectively
without accessing the real facial image. We perform comprehensive evaluations
using multiple datasets and face-swapping methods, which showcases the
effectiveness of SafeVision in detecting face-swap deepfakes across diverse
scenarios. SafeVision offers a reliable and scalable solution for detecting
face-swaps in a privacy preserving manner, making it particularly effective in
challenging real-world applications. To the best of our knowledge, SafeVision
is the first deepfake detection using style features while providing inherent
privacy protection.
♻ ☆ Guided Neural Schrödinger bridge for Brain MR image synthesis with Limited Data
Hanyeol Yang, Sunggyu Kim, Mi Kyung Kim, Yongseon Yoo, Yu-Mi Kim, Min-Ho Shin, Insung Chung, Sang Baek Koh, Hyeon Chang Kim, Jong-Min Lee
Multi-modal brain MRI provides essential complementary information for
clinical diagnosis. However, acquiring all modalities in practice is often
constrained by time and cost. To address this, various methods have been
proposed to generate missing modalities from available ones. Traditional
approaches can be broadly categorized into two main types: paired and unpaired
methods. While paired methods for synthesizing missing modalities achieve high
accuracy, obtaining large-scale paired datasets is typically impractical. In
contrast, unpaired methods, though scalable, often fail to preserve critical
anatomical features, such as lesions. In this paper, we propose Fully Guided
Schr\"odinger Bridge (FGSB), a novel framework designed to overcome these
limitations by enabling high-fidelity generation with extremely limited paired
data. Furthermore, when provided with lesion-specific information such as
expert annotations, segmentation tools, or simple intensity thresholds for
critical regions, FGSB can generate missing modalities while preserving these
significant lesion with reduced data requirements. Our model comprises two
stages: 1) Generation Phase: Iteratively refines synthetic images using paired
target image and Gaussian noise. Training Phase: Learns optimal transformation
pathways from source to target modality by mapping all intermediate states,
ensuring consistent and high-fidelity synthesis. Experimental results across
multiple datasets demonstrate that FGSB achieved performance comparable to
large-data-trained models, while using only two subjects. Incorporating
lesion-specific priors further improves the preservation of clinical features.
comment: Single column, 28 pages, 7 figures
♻ ☆ Dual Data Alignment Makes AI-Generated Image Detector Easier Generalizable
Ruoxin Chen, Junwei Xi, Zhiyuan Yan, Ke-Yue Zhang, Shuang Wu, Jingyi Xie, Xu Chen, Lei Xu, Isabel Guan, Taiping Yao, Shouhong Ding
Existing detectors are often trained on biased datasets, leading to the
possibility of overfitting on non-causal image attributes that are spuriously
correlated with real/synthetic labels. While these biased features enhance
performance on the training data, they result in substantial performance
degradation when applied to unbiased datasets. One common solution is to
perform dataset alignment through generative reconstruction, matching the
semantic content between real and synthetic images. However, we revisit this
approach and show that pixel-level alignment alone is insufficient. The
reconstructed images still suffer from frequency-level misalignment, which can
perpetuate spurious correlations. To illustrate, we observe that reconstruction
models tend to restore the high-frequency details lost in real images (possibly
due to JPEG compression), inadvertently creating a frequency-level
misalignment, where synthetic images appear to have richer high-frequency
content than real ones. This misalignment leads to models associating
high-frequency features with synthetic labels, further reinforcing biased cues.
To resolve this, we propose Dual Data Alignment (DDA), which aligns both the
pixel and frequency domains. Moreover, we introduce two new test sets:
DDA-COCO, containing DDA-aligned synthetic images for testing detector
performance on the most aligned dataset, and EvalGEN, featuring the latest
generative models for assessing detectors under new generative architectures
such as visual auto-regressive generators. Finally, our extensive evaluations
demonstrate that a detector trained exclusively on DDA-aligned MSCOCO could
improve across 8 diverse benchmarks by a non-trivial margin, showing a +7.2% on
in-the-wild benchmarks, highlighting the improved generalizability of unbiased
detectors.
comment: 12 Pages, 9 figures
♻ ☆ PyVision: Agentic Vision with Dynamic Tooling
LLMs are increasingly deployed as agents, systems capable of planning,
reasoning, and dynamically calling external tools. However, in visual
reasoning, prior approaches largely remain limited by predefined workflows and
static toolsets. In this report, we present PyVision, an interactive,
multi-turn framework that enables MLLMs to autonomously generate, execute, and
refine Python-based tools tailored to the task at hand, unlocking flexible and
interpretable problem-solving. We develop a taxonomy of the tools created by
PyVision and analyze their usage across a diverse set of benchmarks.
Quantitatively, PyVision achieves consistent performance gains, boosting
GPT-4.1 by +7.8% on V* and Claude-4.0-Sonnet by +31.1% on VLMsAreBlind-mini.
These results point to a broader shift: dynamic tooling allows models not just
to use tools, but to invent them, advancing toward more agentic visual
reasoning.
comment: 26 Pages, 10 Figures, Technical report
♻ ☆ Democratizing High-Fidelity Co-Speech Gesture Video Generation ICCV 2025
Co-speech gesture video generation aims to synthesize realistic,
audio-aligned videos of speakers, complete with synchronized facial expressions
and body gestures. This task presents challenges due to the significant
one-to-many mapping between audio and visual content, further complicated by
the scarcity of large-scale public datasets and high computational demands. We
propose a lightweight framework that utilizes 2D full-body skeletons as an
efficient auxiliary condition to bridge audio signals with visual outputs. Our
approach introduces a diffusion model conditioned on fine-grained audio
segments and a skeleton extracted from the speaker's reference image,
predicting skeletal motions through skeleton-audio feature fusion to ensure
strict audio coordination and body shape consistency. The generated skeletons
are then fed into an off-the-shelf human video generation model with the
speaker's reference image to synthesize high-fidelity videos. To democratize
research, we present CSG-405-the first public dataset with 405 hours of
high-resolution videos across 71 speech types, annotated with 2D skeletons and
diverse speaker demographics. Experiments show that our method exceeds
state-of-the-art approaches in visual quality and synchronization while
generalizing across speakers and contexts. Code, models, and CSG-405 are
publicly released at https://mpi-lab.github.io/Democratizing-CSG/
comment: ICCV 2025
♻ ☆ Adversarial Augmentation Training Makes Action Recognition Models More Robust to Realistic Video Distribution Shifts ICPR
Despite recent advances in video action recognition achieving strong
performance on existing benchmarks, these models often lack robustness when
faced with natural distribution shifts between training and test data. We
propose two novel evaluation methods to assess model resilience to such
distribution disparity. One method uses two different datasets collected from
different sources and uses one for training and validation, and the other for
testing. More precisely, we created dataset splits of HMDB-51 or UCF-101 for
training, and Kinetics-400 for testing, using the subset of the classes that
are overlapping in both train and test datasets. The other proposed method
extracts the feature mean of each class from the target evaluation dataset's
training data (i.e. class prototype) and estimates test video prediction as a
cosine similarity score between each sample to the class prototypes of each
target class. This procedure does not alter model weights using the target
dataset and it does not require aligning overlapping classes of two different
datasets, thus is a very efficient method to test the model robustness to
distribution shifts without prior knowledge of the target distribution. We
address the robustness problem by adversarial augmentation training -
generating augmented views of videos that are "hard" for the classification
model by applying gradient ascent on the augmentation parameters - as well as
"curriculum" scheduling the strength of the video augmentations. We
experimentally demonstrate the superior performance of the proposed adversarial
augmentation approach over baselines across three state-of-the-art action
recognition models - TSM, Video Swin Transformer, and Uniformer. The presented
work provides critical insight into model robustness to distribution shifts and
presents effective techniques to enhance video action recognition performance
in a real-world deployment.
comment: Accepted to ICPRAI 2024
♻ ☆ Unraveling the Connections between Flow Matching and Diffusion Probabilistic Models in Training-free Conditional Generation
Training-free conditional generation based on flow matching aims to leverage
pre-trained unconditional flow matching models to perform conditional
generation without retraining. Recently, a successful training-free conditional
generation approach incorporates conditions via posterior sampling, which
relies on the availability of a score function in the unconditional diffusion
model. However, flow matching models do not possess an explicit score function,
rendering such a strategy inapplicable. Approximate posterior sampling for flow
matching has been explored, but it is limited to linear inverse problems. In
this paper, we propose Flow Matching-based Posterior Sampling (FMPS) to expand
its application scope. We introduce a correction term by steering the velocity
field. This correction term can be reformulated to incorporate a surrogate
score function, thereby bridging the gap between flow matching models and
score-based posterior sampling. Hence, FMPS enables the posterior sampling to
be adjusted within the flow matching framework. Further, we propose two
practical implementations of the correction mechanism: one aimed at improving
generation quality, and the other focused on computational efficiency.
Experimental results on diverse conditional generation tasks demonstrate that
our method achieves superior generation quality compared to existing
state-of-the-art approaches, validating the effectiveness and generality of
FMPS.
♻ ☆ Video Individual Counting for Moving Drones ICCV 2025
Video Individual Counting (VIC) has received increasing attention for its
importance in intelligent video surveillance. Existing works are limited in two
aspects, i.e., dataset and method. Previous datasets are captured with fixed or
rarely moving cameras with relatively sparse individuals, restricting
evaluation for a highly varying view and time in crowded scenes. Existing
methods rely on localization followed by association or classification, which
struggle under dense and dynamic conditions due to inaccurate localization of
small targets. To address these issues, we introduce the MovingDroneCrowd
Dataset, featuring videos captured by fast-moving drones in crowded scenes
under diverse illuminations, shooting heights and angles. We further propose a
Shared Density map-guided Network (SDNet) using a Depth-wise Cross-Frame
Attention (DCFA) module to directly estimate shared density maps between
consecutive frames, from which the inflow and outflow density maps are derived
by subtracting the shared density maps from the global density maps. The inflow
density maps across frames are summed up to obtain the number of unique
pedestrians in a video. Experiments on our datasets and publicly available ones
show the superiority of our method over the state of the arts in highly dynamic
and complex crowded scenes. Our dataset and codes have been released publicly.
comment: This work has been accepted to ICCV 2025
♻ ☆ A review of advancements in low-light image enhancement using deep learning
In low-light environments, the performance of computer vision algorithms
often deteriorates significantly, adversely affecting key vision tasks such as
segmentation, detection, and classification. With the rapid advancement of deep
learning, its application to low-light image processing has attracted
widespread attention and seen significant progress in recent years. However,
there remains a lack of comprehensive surveys that systematically examine how
recent deep-learning-based low-light image enhancement methods function and
evaluate their effectiveness in enhancing downstream vision tasks. To address
this gap, this review provides detailed elaboration on how various recent
approaches (from 2020) operate and their enhancement mechanisms, supplemented
with clear illustrations. It also investigates the impact of different
enhancement techniques on subsequent vision tasks, critically analyzing their
strengths and limitations. Our review found that image enhancement improved the
performance of downstream vision tasks to varying degrees. Although supervised
methods often produced images with high perceptual quality, they typically
produced modest improvements in vision tasks. In contrast, zero-shot learning,
despite achieving lower scores in image quality metrics, showed consistently
boosted performance across various vision tasks. These suggest a disconnect
between image quality metrics and those evaluating vision task performance.
Additionally, unsupervised domain adaptation techniques demonstrated
significant gains in segmentation tasks, highlighting their potential in
practical low-light scenarios where labelled data is scarce. Observed
limitations of existing studies are analyzed, and directions for future
research are proposed. This review serves as a useful reference for determining
low-light image enhancement techniques and optimizing vision task performance
in low-light conditions.
♻ ☆ RealKeyMorph: Keypoints in Real-world Coordinates for Resolution-agnostic Image Registration
Many real-world settings require registration of a pair of medical images
that differ in spatial resolution, which may arise from differences in image
acquisition parameters like pixel spacing, slice thickness, and field-of-view.
However, all previous machine learning-based registration techniques resample
images onto a fixed resolution. This is suboptimal because resampling can
introduce artifacts due to interpolation. To address this, we present
RealKeyMorph (RKM), a resolution-agnostic method for image registration. RKM is
an extension of KeyMorph, a registration framework which works by training a
network to learn corresponding keypoints for a given pair of images, after
which a closed-form keypoint matching step is used to derive the transformation
that aligns them. To avoid resampling and enable operating on the raw data, RKM
outputs keypoints in real-world coordinates of the scanner. To do this, we
leverage the affine matrix produced by the scanner (e.g., MRI machine) that
encodes the mapping from voxel coordinates to real world coordinates. By
transforming keypoints into real-world space and integrating this into the
training process, RKM effectively enables the extracted keypoints to be
resolution-agnostic. In our experiments, we demonstrate the advantages of RKM
on the registration task for orthogonal 2D stacks of abdominal MRIs, as well as
3D volumes with varying resolutions in brain datasets.
comment: 23 pages, 8 figures
♻ ☆ HANDI: Hand-Centric Text-and-Image Conditioned Video Generation
Despite the recent strides in video generation, state-of-the-art methods
still struggle with elements of visual detail. One particularly challenging
case is the class of videos in which the intricate motion of the hand coupled
with a mostly stable and otherwise distracting environment is necessary to
convey the execution of some complex action and its effects. To address these
challenges, we introduce a new method for video generation that focuses on
hand-centric actions. Our diffusion-based method incorporates two distinct
innovations. First, we propose an automatic method to generate the motion area
-- the region in the video in which the detailed activities occur -- guided by
both the visual context and the action text prompt, rather than assuming this
region can be provided manually as is now commonplace. Second, we introduce a
critical Hand Refinement Loss to guide the diffusion model to focus on smooth
and consistent hand poses. We evaluate our method on challenging augmented
datasets based on EpicKitchens and Ego4D, demonstrating significant
improvements over state-of-the-art methods in terms of action clarity,
especially of the hand motion in the target region, across diverse environments
and actions. Video results can be found in
https://excitedbutter.github.io/project_page
comment: 16 pages, 7 figures and 4 tables
♻ ☆ LVAgent: Long Video Understanding by Multi-Round Dynamical Collaboration of MLLM Agents ICCV 2025
Existing MLLMs encounter significant challenges in modeling the temporal
context within long videos. Currently, mainstream Agent-based methods use
external tools to assist a single MLLM in answering long video questions.
Despite such tool-based support, a solitary MLLM still offers only a partial
understanding of long videos, resulting in limited performance. In order to
better address long video tasks, we introduce LVAgent, the first framework
enabling multi-round dynamic collaboration of MLLM agents in long video
understanding. Our method consists of four key steps: 1) Selection: We
pre-select appropriate agents from the model library to form optimal agent
teams based on different tasks. 2) Perception: We design an effective retrieval
scheme for long videos to improve the coverage of critical temporal segments
while maintaining computational efficiency. 3) Action: Agents answer long video
questions and exchange reasons. 4) Reflection: We evaluate each agent's
performance in each round of discussion and optimize the agent team for dynamic
collaboration. The agents iteratively refine their answers by multi-round
dynamical collaboration of MLLM agents. LVAgent is the first agent system
method that outperforms all closed-source models (like GPT-4o) and open-source
models (like InternVL-2.5 and Qwen2-VL) in the long video understanding tasks.
Our LVAgent achieves an accuracy of 80\% on four mainstream long video
understanding tasks. Notably, LVAgent improves accuracy by 13.3\% on
LongVideoBench. Code is available at https://github.com/64327069/LVAgent.
comment: accepted in ICCV 2025
♻ ☆ CRISP-SAM2: SAM2 with Cross-Modal Interaction and Semantic Prompting for Multi-Organ Segmentation
Xinlei Yu, Changmiao Wang, Hui Jin, Ahmed Elazab, Gangyong Jia, Xiang Wan, Changqing Zou, Ruiquan Ge
Multi-organ medical segmentation is a crucial component of medical image
processing, essential for doctors to make accurate diagnoses and develop
effective treatment plans. Despite significant progress in this field, current
multi-organ segmentation models often suffer from inaccurate details,
dependence on geometric prompts and loss of spatial information. Addressing
these challenges, we introduce a novel model named CRISP-SAM2 with CRoss-modal
Interaction and Semantic Prompting based on SAM2. This model represents a
promising approach to multi-organ medical segmentation guided by textual
descriptions of organs. Our method begins by converting visual and textual
inputs into cross-modal contextualized semantics using a progressive
cross-attention interaction mechanism. These semantics are then injected into
the image encoder to enhance the detailed understanding of visual information.
To eliminate reliance on geometric prompts, we use a semantic prompting
strategy, replacing the original prompt encoder to sharpen the perception of
challenging targets. In addition, a similarity-sorting self-updating strategy
for memory and a mask-refining process is applied to further adapt to medical
imaging and enhance localized details. Comparative experiments conducted on
seven public datasets indicate that CRISP-SAM2 outperforms existing models.
Extensive analysis also demonstrates the effectiveness of our method, thereby
confirming its superior performance, especially in addressing the limitations
mentioned earlier. Our code is available at:
https://github.com/YU-deep/CRISP_SAM2.git.
comment: Accepted By ACMMM25
♻ ☆ CLiFT: Compressive Light-Field Tokens for Compute-Efficient and Adaptive Neural Rendering
This paper proposes a neural rendering approach that represents a scene as
"compressed light-field tokens (CLiFTs)", retaining rich appearance and
geometric information of a scene. CLiFT enables compute-efficient rendering by
compressed tokens, while being capable of changing the number of tokens to
represent a scene or render a novel view with one trained network. Concretely,
given a set of images, multi-view encoder tokenizes the images with the camera
poses. Latent-space K-means selects a reduced set of rays as cluster centroids
using the tokens. The multi-view ``condenser'' compresses the information of
all the tokens into the centroid tokens to construct CLiFTs. At test time,
given a target view and a compute budget (i.e., the number of CLiFTs), the
system collects the specified number of nearby tokens and synthesizes a novel
view using a compute-adaptive renderer. Extensive experiments on RealEstate10K
and DL3DV datasets quantitatively and qualitatively validate our approach,
achieving significant data reduction with comparable rendering quality and the
highest overall rendering score, while providing trade-offs of data size,
rendering quality, and rendering speed.
comment: Project page: https://clift-nvs.github.io
♻ ☆ Frenet-Serret Frame-based Decomposition for Part Segmentation of 3D Curvilinear Structures
Leslie Gu, Jason Ken Adhinarta, Mikhail Bessmeltsev, Jiancheng Yang, Yongjie Jessica Zhang, Wenjie Yin, Daniel Berger, Jeff Lichtman, Hanspeter Pfister, Donglai Wei
Accurately segmenting 3D curvilinear structures in medical imaging remains
challenging due to their complex geometry and the scarcity of diverse,
large-scale datasets for algorithm development and evaluation. In this paper,
we use dendritic spine segmentation as a case study and address these
challenges by introducing a novel Frenet--Serret Frame-based Decomposition,
which decomposes 3D curvilinear structures into a globally \( C^2 \) continuous
curve that captures the overall shape, and a cylindrical primitive that encodes
local geometric properties. This approach leverages Frenet--Serret Frames and
arc length parameterization to preserve essential geometric features while
reducing representational complexity, facilitating data-efficient learning,
improved segmentation accuracy, and generalization on 3D curvilinear
structures. To rigorously evaluate our method, we introduce two datasets:
CurviSeg, a synthetic dataset for 3D curvilinear structure segmentation that
validates our method's key properties, and DenSpineEM, a benchmark for
dendritic spine segmentation, which comprises 4,476 manually annotated spines
from 70 dendrites across three public electron microscopy datasets, covering
multiple brain regions and species. Our experiments on DenSpineEM demonstrate
exceptional cross-region and cross-species generalization: models trained on
the mouse somatosensory cortex subset achieve 91.9\% Dice, maintaining strong
performance in zero-shot segmentation on both mouse visual cortex (94.1\% Dice)
and human frontal lobe (81.8\% Dice) subsets. Moreover, we test the
generalizability of our method on the IntrA dataset, where it achieves 77.08\%
Dice (5.29\% higher than prior arts) on intracranial aneurysm segmentation.
These findings demonstrate the potential of our approach for accurately
analyzing complex curvilinear structures across diverse medical imaging fields.
comment: 10 pages, 4 figures
♻ ☆ A General Framework for Inference-time Scaling and Steering of Diffusion Models
Raghav Singhal, Zachary Horvitz, Ryan Teehan, Mengye Ren, Zhou Yu, Kathleen McKeown, Rajesh Ranganath
Diffusion models produce impressive results in modalities ranging from images
and video to protein design and text. However, generating samples with
user-specified properties remains a challenge. Recent research proposes
fine-tuning models to maximize rewards that capture desired properties, but
these methods require expensive training and are prone to mode collapse. In
this work, we present Feynman-Kac (FK) steering, an inference-time framework
for steering diffusion models with reward functions. FK steering works by
sampling a system of multiple interacting diffusion processes, called
particles, and resampling particles at intermediate steps based on scores
computed using functions called potentials. Potentials are defined using
rewards for intermediate states and are selected such that a high value
indicates that the particle will yield a high-reward sample. We explore various
choices of potentials, intermediate rewards, and samplers. We evaluate FK
steering on text-to-image and text diffusion models. For steering text-to-image
models with a human preference reward, we find that FK steering a 0.8B
parameter model outperforms a 2.6B parameter fine-tuned model on prompt
fidelity, with faster sampling and no training. For steering text diffusion
models with rewards for text quality and specific text attributes, we find that
FK steering generates lower perplexity, more linguistically acceptable outputs
and enables gradient-free control of attributes like toxicity. Our results
demonstrate that inference-time scaling and steering of diffusion models - even
with off-the-shelf rewards - can provide significant sample quality gains and
controllability benefits. Code is available at
https://github.com/zacharyhorvitz/Fk-Diffusion-Steering .
♻ ☆ Concept Steerers: Leveraging K-Sparse Autoencoders for Test-Time Controllable Generations
Despite the remarkable progress in text-to-image generative models, they are
prone to adversarial attacks and inadvertently generate unsafe, unethical
content. Existing approaches often rely on fine-tuning models to remove
specific concepts, which is computationally expensive, lacks scalability,
and/or compromises generation quality. In this work, we propose a novel
framework leveraging k-sparse autoencoders (k-SAEs) to enable efficient and
interpretable concept manipulation in diffusion models. Specifically, we first
identify interpretable monosemantic concepts in the latent space of text
embeddings and leverage them to precisely steer the generation away or towards
a given concept (e.g., nudity) or to introduce a new concept (e.g.,
photographic style) -- all during test time. Through extensive experiments, we
demonstrate that our approach is very simple, requires no retraining of the
base model nor LoRA adapters, does not compromise the generation quality, and
is robust to adversarial prompt manipulations. Our method yields an improvement
of $\mathbf{20.01\%}$ in unsafe concept removal, is effective in style
manipulation, and is $\mathbf{\sim5}$x faster than the current
state-of-the-art. Code is available at: https://github.com/kim-dahye/steerers
comment: 23 pages, 18 figures