Computer Vision and Pattern Recognition 126
☆ FLUX-Reason-6M & PRISM-Bench: A Million-Scale Text-to-Image Reasoning Dataset and Comprehensive Benchmark
Rongyao Fang, Aldrich Yu, Chengqi Duan, Linjiang Huang, Shuai Bai, Yuxuan Cai, Kun Wang, Si Liu, Xihui Liu, Hongsheng Li
The advancement of open-source text-to-image (T2I) models has been hindered
by the absence of large-scale, reasoning-focused datasets and comprehensive
evaluation benchmarks, resulting in a performance gap compared to leading
closed-source systems. To address this challenge, We introduce FLUX-Reason-6M
and PRISM-Bench (Precise and Robust Image Synthesis Measurement Benchmark).
FLUX-Reason-6M is a massive dataset consisting of 6 million high-quality
FLUX-generated images and 20 million bilingual (English and Chinese)
descriptions specifically designed to teach complex reasoning. The image are
organized according to six key characteristics: Imagination, Entity, Text
rendering, Style, Affection, and Composition, and design explicit Generation
Chain-of-Thought (GCoT) to provide detailed breakdowns of image generation
steps. The whole data curation takes 15,000 A100 GPU days, providing the
community with a resource previously unattainable outside of large industrial
labs. PRISM-Bench offers a novel evaluation standard with seven distinct
tracks, including a formidable Long Text challenge using GCoT. Through
carefully designed prompts, it utilizes advanced vision-language models for
nuanced human-aligned assessment of prompt-image alignment and image
aesthetics. Our extensive evaluation of 19 leading models on PRISM-Bench
reveals critical performance gaps and highlights specific areas requiring
improvement. Our dataset, benchmark, and evaluation code are released to
catalyze the next wave of reasoning-oriented T2I generation. Project page:
https://flux-reason-6m.github.io/ .
comment: Project page: https://flux-reason-6m.github.io/
☆ SpatialVID: A Large-Scale Video Dataset with Spatial Annotations
Jiahao Wang, Yufeng Yuan, Rujie Zheng, Youtian Lin, Jian Gao, Lin-Zhuo Chen, Yajie Bao, Yi Zhang, Chang Zeng, Yanxi Zhou, Xiaoxiao Long, Hao Zhu, Zhaoxiang Zhang, Xun Cao, Yao Yao
Significant progress has been made in spatial intelligence, spanning both
spatial reconstruction and world exploration. However, the scalability and
real-world fidelity of current models remain severely constrained by the
scarcity of large-scale, high-quality training data. While several datasets
provide camera pose information, they are typically limited in scale,
diversity, and annotation richness, particularly for real-world dynamic scenes
with ground-truth camera motion. To this end, we collect \textbf{SpatialVID}, a
dataset consists of a large corpus of in-the-wild videos with diverse scenes,
camera movements and dense 3D annotations such as per-frame camera poses,
depth, and motion instructions. Specifically, we collect more than 21,000 hours
of raw video, and process them into 2.7 million clips through a hierarchical
filtering pipeline, totaling 7,089 hours of dynamic content. A subsequent
annotation pipeline enriches these clips with detailed spatial and semantic
information, including camera poses, depth maps, dynamic masks, structured
captions, and serialized motion instructions. Analysis of SpatialVID's data
statistics reveals a richness and diversity that directly foster improved model
generalization and performance, establishing it as a key asset for the video
and 3D vision research community.
comment: Project page: https://nju-3dv.github.io/projects/SpatialVID/
☆ Locality in Image Diffusion Models Emerges from Data Statistics
Among generative models, diffusion models are uniquely intriguing due to the
existence of a closed-form optimal minimizer of their training objective, often
referred to as the optimal denoiser. However, diffusion using this optimal
denoiser merely reproduces images in the training set and hence fails to
capture the behavior of deep diffusion models. Recent work has attempted to
characterize this gap between the optimal denoiser and deep diffusion models,
proposing analytical, training-free models that can generate images that
resemble those generated by a trained UNet. The best-performing method
hypothesizes that shift equivariance and locality inductive biases of
convolutional neural networks are the cause of the performance gap, hence
incorporating these assumptions into its analytical model. In this work, we
present evidence that the locality in deep diffusion models emerges as a
statistical property of the image dataset, not due to the inductive bias of
convolutional neural networks. Specifically, we demonstrate that an optimal
parametric linear denoiser exhibits similar locality properties to the deep
neural denoisers. We further show, both theoretically and experimentally, that
this locality arises directly from the pixel correlations present in natural
image datasets. Finally, we use these insights to craft an analytical denoiser
that better matches scores predicted by a deep diffusion model than the prior
expert-crafted alternative.
comment: 30 pages, 18 figures, 6 tables
☆ Dexplore: Scalable Neural Control for Dexterous Manipulation from Reference-Scoped Exploration
Hand-object motion-capture (MoCap) repositories offer large-scale,
contact-rich demonstrations and hold promise for scaling dexterous robotic
manipulation. Yet demonstration inaccuracies and embodiment gaps between human
and robot hands limit the straightforward use of these data. Existing methods
adopt a three-stage workflow, including retargeting, tracking, and residual
correction, which often leaves demonstrations underused and compound errors
across stages. We introduce Dexplore, a unified single-loop optimization that
jointly performs retargeting and tracking to learn robot control policies
directly from MoCap at scale. Rather than treating demonstrations as ground
truth, we use them as soft guidance. From raw trajectories, we derive adaptive
spatial scopes, and train with reinforcement learning to keep the policy
in-scope while minimizing control effort and accomplishing the task. This
unified formulation preserves demonstration intent, enables robot-specific
strategies to emerge, improves robustness to noise, and scales to large
demonstration corpora. We distill the scaled tracking policy into a
vision-based, skill-conditioned generative controller that encodes diverse
manipulation skills in a rich latent representation, supporting generalization
across objects and real-world deployment. Taken together, these contributions
position Dexplore as a principled bridge that transforms imperfect
demonstrations into effective training signals for dexterous manipulation.
comment: CoRL 2025
☆ Geometric Neural Distance Fields for Learning Human Motion Priors
We introduce Neural Riemannian Motion Fields (NRMF), a novel 3D generative
human motion prior that enables robust, temporally consistent, and physically
plausible 3D motion recovery. Unlike existing VAE or diffusion-based methods,
our higher-order motion prior explicitly models the human motion in the zero
level set of a collection of neural distance fields (NDFs) corresponding to
pose, transition (velocity), and acceleration dynamics. Our framework is
rigorous in the sense that our NDFs are constructed on the product space of
joint rotations, their angular velocities, and angular accelerations,
respecting the geometry of the underlying articulations. We further introduce:
(i) a novel adaptive-step hybrid algorithm for projecting onto the set of
plausible motions, and (ii) a novel geometric integrator to "roll out"
realistic motion trajectories during test-time-optimization and generation. Our
experiments show significant and consistent gains: trained on the AMASS
dataset, NRMF remarkably generalizes across multiple input modalities and to
diverse tasks ranging from denoising to motion in-betweening and fitting to
partial 2D / 3D observations.
comment: 8 pages
☆ Can Understanding and Generation Truly Benefit Together -- or Just Coexist?
Zhiyuan Yan, Kaiqing Lin, Zongjian Li, Junyan Ye, Hui Han, Zhendong Wang, Hao Liu, Bin Lin, Hao Li, Xue Xu, Xinyan Xiao, Jingdong Wang, Haifeng Wang, Li Yuan
In this paper, we introduce an insightful paradigm through the Auto-Encoder
lens-understanding as the encoder (I2T) that compresses images into text, and
generation as the decoder (T2I) that reconstructs images from that text. Using
reconstruction fidelity as the unified training objective, we enforce the
coherent bidirectional information flow between the understanding and
generation processes, bringing mutual gains. To implement this, we propose UAE,
a novel framework for unified multimodal learning. We begin by pre-training the
decoder with large-scale long-context image captions to capture fine-grained
semantic and complex spatial relationships. We then propose Unified-GRPO via
reinforcement learning (RL), which covers three stages: (1) A cold-start phase
to gently initialize both encoder and decoder with a semantic reconstruction
loss; (2) Generation for Understanding, where the encoder is trained to
generate informative captions that maximize the decoder's reconstruction
quality, enhancing its visual understanding; (3) Understanding for Generation,
where the decoder is refined to reconstruct from these captions, forcing it to
leverage every detail and improving its long-context instruction following and
generation fidelity. For evaluation, we introduce Unified-Bench, the first
benchmark tailored to assess the degree of unification of the UMMs. A
surprising "aha moment" arises within the multimodal learning domain: as RL
progresses, the encoder autonomously produces more descriptive captions, while
the decoder simultaneously demonstrates a profound ability to understand these
intricate descriptions, resulting in reconstructions of striking fidelity.
☆ Measuring Epistemic Humility in Multimodal Large Language Models
Hallucinations in multimodal large language models (MLLMs) -- where the model
generates content inconsistent with the input image -- pose significant risks
in real-world applications, from misinformation in visual question answering to
unsafe errors in decision-making. Existing benchmarks primarily test
recognition accuracy, i.e., evaluating whether models can select the correct
answer among distractors. This overlooks an equally critical capability for
trustworthy AI: recognizing when none of the provided options are correct, a
behavior reflecting epistemic humility. We present HumbleBench, a new
hallucination benchmark designed to evaluate MLLMs' ability to reject plausible
but incorrect answers across three hallucination types: object, relation, and
attribute. Built from a panoptic scene graph dataset, we leverage fine-grained
scene graph annotations to extract ground-truth entities and relations, and
prompt GPT-4-Turbo to generate multiple-choice questions, followed by a
rigorous manual filtering process. Each question includes a "None of the above"
option, requiring models not only to recognize correct visual information but
also to identify when no provided answer is valid. We evaluate a variety of
state-of-the-art MLLMs -- including both general-purpose and specialized
reasoning models -- on HumbleBench and share valuable findings and insights
with the community. By incorporating explicit false-option rejection,
HumbleBench fills a key gap in current evaluation suites, providing a more
realistic measure of MLLM reliability in safety-critical settings. Our code and
dataset are released publicly and can be accessed at
https://github.com/maifoundations/HumbleBench.
☆ DiFlow-TTS: Discrete Flow Matching with Factorized Speech Tokens for Low-Latency Zero-Shot Text-To-Speech
Zero-shot Text-to-Speech (TTS) aims to synthesize high-quality speech that
mimics the voice of an unseen speaker using only a short reference sample,
requiring not only speaker adaptation but also accurate modeling of prosodic
attributes. Recent approaches based on language models, diffusion, and flow
matching have shown promising results in zero-shot TTS, but still suffer from
slow inference and repetition artifacts. Discrete codec representations have
been widely adopted for speech synthesis, and recent works have begun to
explore diffusion models in purely discrete settings, suggesting the potential
of discrete generative modeling for speech synthesis. However, existing
flow-matching methods typically embed these discrete tokens into a continuous
space and apply continuous flow matching, which may not fully leverage the
advantages of discrete representations. To address these challenges, we
introduce DiFlow-TTS, which, to the best of our knowledge, is the first model
to explore purely Discrete Flow Matching for speech synthesis. DiFlow-TTS
explicitly models factorized speech attributes within a compact and unified
architecture. It leverages in-context learning by conditioning on textual
content, along with prosodic and acoustic attributes extracted from a reference
speech, enabling effective attribute cloning in a zero-shot setting. In
addition, the model employs a factorized flow prediction mechanism with
distinct heads for prosody and acoustic details, allowing it to learn
aspect-specific distributions. Experimental results demonstrate that DiFlow-TTS
achieves promising performance in several key metrics, including naturalness,
prosody, preservation of speaker style, and energy control. It also maintains a
compact model size and achieves low-latency inference, generating speech up to
25.8 times faster than the latest existing baselines.
☆ Mechanistic Learning with Guided Diffusion Models to Predict Spatio-Temporal Brain Tumor Growth
Daria Laslo, Efthymios Georgiou, Marius George Linguraru, Andreas Rauschecker, Sabine Muller, Catherine R. Jutzeler, Sarah Bruningk
Predicting the spatio-temporal progression of brain tumors is essential for
guiding clinical decisions in neuro-oncology. We propose a hybrid mechanistic
learning framework that combines a mathematical tumor growth model with a
guided denoising diffusion implicit model (DDIM) to synthesize anatomically
feasible future MRIs from preceding scans. The mechanistic model, formulated as
a system of ordinary differential equations, captures temporal tumor dynamics
including radiotherapy effects and estimates future tumor burden. These
estimates condition a gradient-guided DDIM, enabling image synthesis that
aligns with both predicted growth and patient anatomy. We train our model on
the BraTS adult and pediatric glioma datasets and evaluate on 60 axial slices
of in-house longitudinal pediatric diffuse midline glioma (DMG) cases. Our
framework generates realistic follow-up scans based on spatial similarity
metrics. It also introduces tumor growth probability maps, which capture both
clinically relevant extent and directionality of tumor growth as shown by 95th
percentile Hausdorff Distance. The method enables biologically informed image
generation in data-limited scenarios, offering generative-space-time
predictions that account for mechanistic priors.
comment: 13 pages, 4 figures
☆ Graph Alignment via Dual-Pass Spectral Encoding and Latent Space Communication
Graph alignment-the problem of identifying corresponding nodes across
multiple graphs-is fundamental to numerous applications. Most existing
unsupervised methods embed node features into latent representations to enable
cross-graph comparison without ground-truth correspondences. However, these
methods suffer from two critical limitations: the degradation of node
distinctiveness due to oversmoothing in GNN-based embeddings, and the
misalignment of latent spaces across graphs caused by structural noise, feature
heterogeneity, and training instability, ultimately leading to unreliable node
correspondences. We propose a novel graph alignment framework that
simultaneously enhances node distinctiveness and enforces geometric consistency
across latent spaces. Our approach introduces a dual-pass encoder that combines
low-pass and high-pass spectral filters to generate embeddings that are both
structure-aware and highly discriminative. To address latent space
misalignment, we incorporate a geometry-aware functional map module that learns
bijective and isometric transformations between graph embeddings, ensuring
consistent geometric relationships across different representations. Extensive
experiments on graph benchmarks demonstrate that our method consistently
outperforms existing unsupervised alignment baselines, exhibiting superior
robustness to structural inconsistencies and challenging alignment scenarios.
Additionally, comprehensive evaluation on vision-language benchmarks using
diverse pretrained models shows that our framework effectively generalizes
beyond graph domains, enabling unsupervised alignment of vision and language
representations.
comment: 23 pages
☆ Kling-Avatar: Grounding Multimodal Instructions for Cascaded Long-Duration Avatar Animation Synthesis
Yikang Ding, Jiwen Liu, Wenyuan Zhang, Zekun Wang, Wentao Hu, Liyuan Cui, Mingming Lao, Yingchao Shao, Hui Liu, Xiaohan Li, Ming Chen, Xiaoqiang Liu, Yu-Shen Liu, Pengfei Wan
Recent advances in audio-driven avatar video generation have significantly
enhanced audio-visual realism. However, existing methods treat instruction
conditioning merely as low-level tracking driven by acoustic or visual cues,
without modeling the communicative purpose conveyed by the instructions. This
limitation compromises their narrative coherence and character expressiveness.
To bridge this gap, we introduce Kling-Avatar, a novel cascaded framework that
unifies multimodal instruction understanding with photorealistic portrait
generation. Our approach adopts a two-stage pipeline. In the first stage, we
design a multimodal large language model (MLLM) director that produces a
blueprint video conditioned on diverse instruction signals, thereby governing
high-level semantics such as character motion and emotions. In the second
stage, guided by blueprint keyframes, we generate multiple sub-clips in
parallel using a first-last frame strategy. This global-to-local framework
preserves fine-grained details while faithfully encoding the high-level intent
behind multimodal instructions. Our parallel architecture also enables fast and
stable generation of long-duration videos, making it suitable for real-world
applications such as digital human livestreaming and vlogging. To
comprehensively evaluate our method, we construct a benchmark of 375 curated
samples covering diverse instructions and challenging scenarios. Extensive
experiments demonstrate that Kling-Avatar is capable of generating vivid,
fluent, long-duration videos at up to 1080p and 48 fps, achieving superior
performance in lip synchronization accuracy, emotion and dynamic
expressiveness, instruction controllability, identity preservation, and
cross-domain generalization. These results establish Kling-Avatar as a new
benchmark for semantically grounded, high-fidelity audio-driven avatar
synthesis.
comment: Technical Report. Project Page: https://klingavatar.github.io/
☆ ObjectReact: Learning Object-Relative Control for Visual Navigation
Sourav Garg, Dustin Craggs, Vineeth Bhat, Lachlan Mares, Stefan Podgorski, Madhava Krishna, Feras Dayoub, Ian Reid
Visual navigation using only a single camera and a topological map has
recently become an appealing alternative to methods that require additional
sensors and 3D maps. This is typically achieved through an "image-relative"
approach to estimating control from a given pair of current observation and
subgoal image. However, image-level representations of the world have
limitations because images are strictly tied to the agent's pose and
embodiment. In contrast, objects, being a property of the map, offer an
embodiment- and trajectory-invariant world representation. In this work, we
present a new paradigm of learning "object-relative" control that exhibits
several desirable characteristics: a) new routes can be traversed without
strictly requiring to imitate prior experience, b) the control prediction
problem can be decoupled from solving the image matching problem, and c) high
invariance can be achieved in cross-embodiment deployment for variations across
both training-testing and mapping-execution settings. We propose a topometric
map representation in the form of a "relative" 3D scene graph, which is used to
obtain more informative object-level global path planning costs. We train a
local controller, dubbed "ObjectReact", conditioned directly on a high-level
"WayObject Costmap" representation that eliminates the need for an explicit RGB
input. We demonstrate the advantages of learning object-relative control over
its image-relative counterpart across sensor height variations and multiple
navigation tasks that challenge the underlying spatial understanding
capability, e.g., navigating a map trajectory in the reverse direction. We
further show that our sim-only policy is able to generalize well to real-world
indoor environments. Code and supplementary material are accessible via project
page: https://object-react.github.io/
comment: CoRL 2025; 23 pages including appendix
☆ Visual Grounding from Event Cameras ICCV 2025
Lingdong Kong, Dongyue Lu, Ao Liang, Rong Li, Yuhao Dong, Tianshuai Hu, Lai Xing Ng, Wei Tsang Ooi, Benoit R. Cottereau
Event cameras capture changes in brightness with microsecond precision and
remain reliable under motion blur and challenging illumination, offering clear
advantages for modeling highly dynamic scenes. Yet, their integration with
natural language understanding has received little attention, leaving a gap in
multimodal perception. To address this, we introduce Talk2Event, the first
large-scale benchmark for language-driven object grounding using event data.
Built on real-world driving scenarios, Talk2Event comprises 5,567 scenes,
13,458 annotated objects, and more than 30,000 carefully validated referring
expressions. Each expression is enriched with four structured attributes --
appearance, status, relation to the viewer, and relation to surrounding objects
-- that explicitly capture spatial, temporal, and relational cues. This
attribute-centric design supports interpretable and compositional grounding,
enabling analysis that moves beyond simple object recognition to contextual
reasoning in dynamic environments. We envision Talk2Event as a foundation for
advancing multimodal and temporally-aware perception, with applications
spanning robotics, human-AI interaction, and so on.
comment: Abstract Paper (Non-Archival) @ ICCV 2025 NeVi Workshop
☆ PeftCD: Leveraging Vision Foundation Models with Parameter-Efficient Fine-Tuning for Remote Sensing Change Detection
To tackle the prevalence of pseudo changes, the scarcity of labeled samples,
and the difficulty of cross-domain generalization in multi-temporal and
multi-source remote sensing imagery, we propose PeftCD, a change detection
framework built upon Vision Foundation Models (VFMs) with Parameter-Efficient
Fine-Tuning (PEFT). At its core, PeftCD employs a weight-sharing Siamese
encoder derived from a VFM, into which LoRA and Adapter modules are seamlessly
integrated. This design enables highly efficient task adaptation by training
only a minimal set of additional parameters. To fully unlock the potential of
VFMs, we investigate two leading backbones: the Segment Anything Model v2
(SAM2), renowned for its strong segmentation priors, and DINOv3, a
state-of-the-art self-supervised representation learner. The framework is
complemented by a deliberately lightweight decoder, ensuring the focus remains
on the powerful feature representations from the backbones. Extensive
experiments demonstrate that PeftCD achieves state-of-the-art performance
across multiple public datasets, including SYSU-CD (IoU 73.81%), WHUCD
(92.05%), MSRSCD (64.07%), MLCD (76.89%), CDD (97.01%), S2Looking (52.25%) and
LEVIR-CD (85.62%), with notably precise boundary delineation and strong
suppression of pseudo-changes. In summary, PeftCD presents an optimal balance
of accuracy, efficiency, and generalization. It offers a powerful and scalable
paradigm for adapting large-scale VFMs to real-world remote sensing change
detection applications. The code and pretrained models will be released at
https://github.com/dyzy41/PeftCD.
☆ Invisible Attributes, Visible Biases: Exploring Demographic Shortcuts in MRI-based Alzheimer's Disease Classification MICCAI 2025
Magnetic resonance imaging (MRI) is the gold standard for brain imaging. Deep
learning (DL) algorithms have been proposed to aid in the diagnosis of diseases
such as Alzheimer's disease (AD) from MRI scans. However, DL algorithms can
suffer from shortcut learning, in which spurious features, not directly related
to the output label, are used for prediction. When these features are related
to protected attributes, they can lead to performance bias against
underrepresented protected groups, such as those defined by race and sex. In
this work, we explore the potential for shortcut learning and demographic bias
in DL based AD diagnosis from MRI. We first investigate if DL algorithms can
identify race or sex from 3D brain MRI scans to establish the presence or
otherwise of race and sex based distributional shifts. Next, we investigate
whether training set imbalance by race or sex can cause a drop in model
performance, indicating shortcut learning and bias. Finally, we conduct a
quantitative and qualitative analysis of feature attributions in different
brain regions for both the protected attribute and AD classification tasks.
Through these experiments, and using multiple datasets and DL models (ResNet
and SwinTransformer), we demonstrate the existence of both race and sex based
shortcut learning and bias in DL based AD classification. Our work lays the
foundation for fairer DL diagnostic tools in brain MRI. The code is provided at
https://github.com/acharaakshit/ShortMR
comment: FAIMI @ MICCAI 2025
☆ InterAct: Advancing Large-Scale Versatile 3D Human-Object Interaction Generation CVPR 2025
Sirui Xu, Dongting Li, Yucheng Zhang, Xiyan Xu, Qi Long, Ziyin Wang, Yunzhi Lu, Shuchang Dong, Hezi Jiang, Akshat Gupta, Yu-Xiong Wang, Liang-Yan Gui
While large-scale human motion capture datasets have advanced human motion
generation, modeling and generating dynamic 3D human-object interactions (HOIs)
remain challenging due to dataset limitations. Existing datasets often lack
extensive, high-quality motion and annotation and exhibit artifacts such as
contact penetration, floating, and incorrect hand motions. To address these
issues, we introduce InterAct, a large-scale 3D HOI benchmark featuring dataset
and methodological advancements. First, we consolidate and standardize 21.81
hours of HOI data from diverse sources, enriching it with detailed textual
annotations. Second, we propose a unified optimization framework to enhance
data quality by reducing artifacts and correcting hand motions. Leveraging the
principle of contact invariance, we maintain human-object relationships while
introducing motion variations, expanding the dataset to 30.70 hours. Third, we
define six benchmarking tasks and develop a unified HOI generative modeling
perspective, achieving state-of-the-art performance. Extensive experiments
validate the utility of our dataset as a foundational resource for advancing 3D
human-object interaction generation. To support continued research in this
area, the dataset is publicly available at
https://github.com/wzyabcas/InterAct, and will be actively maintained.
comment: CVPR 2025
☆ Improving Video Diffusion Transformer Training by Multi-Feature Fusion and Alignment from Self-Supervised Vision Encoders
Video diffusion models have advanced rapidly in the recent years as a result
of series of architectural innovations (e.g., diffusion transformers) and use
of novel training objectives (e.g., flow matching). In contrast, less attention
has been paid to improving the feature representation power of such models. In
this work, we show that training video diffusion models can benefit from
aligning the intermediate features of the video generator with feature
representations of pre-trained vision encoders. We propose a new metric and
conduct an in-depth analysis of various vision encoders to evaluate their
discriminability and temporal consistency, thereby assessing their suitability
for video feature alignment. Based on the analysis, we present Align4Gen which
provides a novel multi-feature fusion and alignment method integrated into
video diffusion model training. We evaluate Align4Gen both for unconditional
and class-conditional video generation tasks and show that it results in
improved video generation as quantified by various metrics. Full video results
are available on our project page: https://align4gen.github.io/align4gen/
comment: 17 pages, 14 figures
☆ DualTrack: Sensorless 3D Ultrasound needs Local and Global Context
Paul F. R. Wilson, Matteo Ronchetti, Rüdiger Göbl, Viktoria Markova, Sebastian Rosenzweig, Raphael Prevost, Parvin Mousavi, Oliver Zettinig
Three-dimensional ultrasound (US) offers many clinical advantages over
conventional 2D imaging, yet its widespread adoption is limited by the cost and
complexity of traditional 3D systems. Sensorless 3D US, which uses deep
learning to estimate a 3D probe trajectory from a sequence of 2D US images, is
a promising alternative. Local features, such as speckle patterns, can help
predict frame-to-frame motion, while global features, such as coarse shapes and
anatomical structures, can situate the scan relative to anatomy and help
predict its general shape. In prior approaches, global features are either
ignored or tightly coupled with local feature extraction, restricting the
ability to robustly model these two complementary aspects. We propose
DualTrack, a novel dual-encoder architecture that leverages decoupled local and
global encoders specialized for their respective scales of feature extraction.
The local encoder uses dense spatiotemporal convolutions to capture
fine-grained features, while the global encoder utilizes an image backbone
(e.g., a 2D CNN or foundation model) and temporal attention layers to embed
high-level anatomical features and long-range dependencies. A lightweight
fusion module then combines these features to estimate the trajectory.
Experimental results on a large public benchmark show that DualTrack achieves
state-of-the-art accuracy and globally consistent 3D reconstructions,
outperforming previous methods and yielding an average reconstruction error
below 5 mm.
☆ Generative Diffusion Contrastive Network for Multi-View Clustering ICASSP2026
Jian Zhu, Xin Zou, Xi Wang, Ning Zhang, Bian Wu, Yao Yang, Ying Zhou, Lingfang Zeng, Chang Tang, Cheng Luo
In recent years, Multi-View Clustering (MVC) has been significantly advanced
under the influence of deep learning. By integrating heterogeneous data from
multiple views, MVC enhances clustering analysis, making multi-view fusion
critical to clustering performance. However, there is a problem of low-quality
data in multi-view fusion. This problem primarily arises from two reasons: 1)
Certain views are contaminated by noisy data. 2) Some views suffer from missing
data. This paper proposes a novel Stochastic Generative Diffusion Fusion (SGDF)
method to address this problem. SGDF leverages a multiple generative mechanism
for the multi-view feature of each sample. It is robust to low-quality data.
Building on SGDF, we further present the Generative Diffusion Contrastive
Network (GDCN). Extensive experiments show that GDCN achieves the
state-of-the-art results in deep MVC tasks. The source code is publicly
available at https://github.com/HackerHyper/GDCN.
comment: This paper is submitted to International Conference on Acoustics,
Speech, and Signal Processing (ICASSP2026)
☆ Explainable AI for Accelerated Microstructure Imaging: A SHAP-Guided Protocol on the Connectome 2.0 scanner IEEE
Quentin Uhl, Tommaso Pavan, Julianna Gerold, Kwok-Shing Chan, Yohan Jun, Shohei Fujita, Aneri Bhatt, Yixin Ma, Qiaochu Wang, Hong-Hsi Lee, Susie Y. Huang, Berkin Bilgic, Ileana Jelescu
The diffusion MRI Neurite Exchange Imaging model offers a promising framework
for probing gray matter microstructure by estimating parameters such as
compartment sizes, diffusivities, and inter-compartmental water exchange time.
However, existing protocols require long scan times. This study proposes a
reduced acquisition scheme for the Connectome 2.0 scanner that preserves model
accuracy while substantially shortening scan duration. We developed a
data-driven framework using explainable artificial intelligence with a guided
recursive feature elimination strategy to identify an optimal 8-feature subset
from a 15-feature protocol. The performance of this optimized protocol was
validated in vivo and benchmarked against the full acquisition and alternative
reduction strategies. Parameter accuracy, preservation of anatomical contrast,
and test-retest reproducibility were assessed. The reduced protocol yielded
parameter estimates and cortical maps comparable to the full protocol, with low
estimation errors in synthetic data and minimal impact on test-retest
variability. Compared to theory-driven and heuristic reduction schemes, the
optimized protocol demonstrated superior robustness, reducing the deviation in
water exchange time estimates by over two-fold. In conclusion, this hybrid
optimization framework enables viable imaging of neurite exchange in 14 minutes
without loss of parameter fidelity. This approach supports the broader
application of exchange-sensitive diffusion magnetic resonance imaging in
neuroscience and clinical research, and offers a generalizable method for
designing efficient acquisition protocols in biophysical parameter mapping.
comment: Submitted to IEEE Transactions on Medical Imaging (TMI). This
all-in-one version includes supplementary materials. 18 pages, 14 figures, 2
tables
☆ Region-Wise Correspondence Prediction between Manga Line Art Images
Understanding region-wise correspondence between manga line art images is a
fundamental task in manga processing, enabling downstream applications such as
automatic line art colorization and in-between frame generation. However, this
task remains largely unexplored, especially in realistic scenarios without
pre-existing segmentation or annotations. In this paper, we introduce a novel
and practical task: predicting region-wise correspondence between raw manga
line art images without any pre-existing labels or masks. To tackle this
problem, we divide each line art image into a set of patches and propose a
Transformer-based framework that learns patch-level similarities within and
across images. We then apply edge-aware clustering and a region matching
algorithm to convert patch-level predictions into coherent region-level
correspondences. To support training and evaluation, we develop an automatic
annotation pipeline and manually refine a subset of the data to construct
benchmark datasets. Experiments on multiple datasets demonstrate that our
method achieves high patch-level accuracy (e.g., 96.34%) and generates
consistent region-level correspondences, highlighting its potential for
real-world manga applications.
☆ Improving Human Motion Plausibility with Body Momentum BMVC 2025
Many studies decompose human motion into local motion in a frame attached to
the root joint and global motion of the root joint in the world frame, treating
them separately. However, these two components are not independent. Global
movement arises from interactions with the environment, which are, in turn,
driven by changes in the body configuration. Motion models often fail to
precisely capture this physical coupling between local and global dynamics,
while deriving global trajectories from joint torques and external forces is
computationally expensive and complex. To address these challenges, we propose
using whole-body linear and angular momentum as a constraint to link local
motion with global movement. Since momentum reflects the aggregate effect of
joint-level dynamics on the body's movement through space, it provides a
physically grounded way to relate local joint behavior to global displacement.
Building on this insight, we introduce a new loss term that enforces
consistency between the generated momentum profiles and those observed in
ground-truth data. Incorporating our loss reduces foot sliding and jitter,
improves balance, and preserves the accuracy of the recovered motion. Code and
data are available at the project page https://hlinhn.github.io/momentum_bmvc.
comment: Accepted at BMVC 2025
☆ OpenFake: An Open Dataset and Platform Toward Large-Scale Deepfake Detection
Victor Livernoche, Akshatha Arodi, Andreea Musulan, Zachary Yang, Adam Salvail, Gaétan Marceau Caron, Jean-François Godbout, Reihaneh Rabbany
Deepfakes, synthetic media created using advanced AI techniques, have
intensified the spread of misinformation, particularly in politically sensitive
contexts. Existing deepfake detection datasets are often limited, relying on
outdated generation methods, low realism, or single-face imagery, restricting
the effectiveness for general synthetic image detection. By analyzing social
media posts, we identify multiple modalities through which deepfakes propagate
misinformation. Furthermore, our human perception study demonstrates that
recently developed proprietary models produce synthetic images increasingly
indistinguishable from real ones, complicating accurate identification by the
general public. Consequently, we present a comprehensive, politically-focused
dataset specifically crafted for benchmarking detection against modern
generative models. This dataset contains three million real images paired with
descriptive captions, which are used for generating 963k corresponding
high-quality synthetic images from a mix of proprietary and open-source models.
Recognizing the continual evolution of generative techniques, we introduce an
innovative crowdsourced adversarial platform, where participants are
incentivized to generate and submit challenging synthetic images. This ongoing
community-driven initiative ensures that deepfake detection methods remain
robust and adaptive, proactively safeguarding public discourse from
sophisticated misinformation threats.
comment: 25 pages, 12 figures
☆ In-Loop Filtering Using Learned Look-Up Tables for Video Coding
In-loop filtering (ILF) is a key technology in video coding standards to
reduce artifacts and enhance visual quality. Recently, neural network-based ILF
schemes have achieved remarkable coding gains, emerging as a powerful candidate
for next-generation video coding standards. However, the use of deep neural
networks (DNN) brings significant computational and time complexity or high
demands for dedicated hardware, making it challenging for general use. To
address this limitation, we study a practical ILF solution by adopting look-up
tables (LUTs). After training a DNN with a restricted reference range for ILF,
all possible inputs are traversed, and the output values of the DNN are cached
into LUTs. During the coding process, the filtering process is performed by
simply retrieving the filtered pixel through locating the input pixels and
interpolating between the cached values, instead of relying on heavy inference
computations. In this paper, we propose a universal LUT-based ILF framework,
termed LUT-ILF++. First, we introduce the cooperation of multiple kinds of
filtering LUTs and propose a series of customized indexing mechanisms to enable
better filtering reference perception with limited storage consumption. Second,
we propose the cross-component indexing mechanism to enable the filtering of
different color components jointly. Third, in order to make our solution
practical for coding uses, we propose the LUT compaction scheme to enable the
LUT pruning, achieving a lower storage cost of the entire solution. The
proposed framework is implemented in the VVC reference software. Experimental
results show that the proposed framework achieves on average 0.82%/2.97%/1.63%
and 0.85%/4.11%/2.06% bitrate reduction for common test sequences, under the AI
and RA configurations, respectively. Compared to DNN-based solutions, our
proposed solution has much lower time complexity and storage cost.
comment: 25 pages
☆ Resource-Efficient Glioma Segmentation on Sub-Saharan MRI
Freedmore Sidume, Oumayma Soula, Joseph Muthui Wacira, YunFei Zhu, Abbas Rabiu Muhammad, Abderrazek Zeraii, Oluwaseun Kalejaye, Hajer Ibrahim, Olfa Gaddour, Brain Halubanza, Dong Zhang, Udunna C Anazodo, Confidence Raymond
Gliomas are the most prevalent type of primary brain tumors, and their
accurate segmentation from MRI is critical for diagnosis, treatment planning,
and longitudinal monitoring. However, the scarcity of high-quality annotated
imaging data in Sub-Saharan Africa (SSA) poses a significant challenge for
deploying advanced segmentation models in clinical workflows. This study
introduces a robust and computationally efficient deep learning framework
tailored for resource-constrained settings. We leveraged a 3D Attention UNet
architecture augmented with residual blocks and enhanced through transfer
learning from pre-trained weights on the BraTS 2021 dataset. Our model was
evaluated on 95 MRI cases from the BraTS-Africa dataset, a benchmark for glioma
segmentation in SSA MRI data. Despite the limited data quality and quantity,
our approach achieved Dice scores of 0.76 for the Enhancing Tumor (ET), 0.80
for Necrotic and Non-Enhancing Tumor Core (NETC), and 0.85 for Surrounding
Non-Functional Hemisphere (SNFH). These results demonstrate the
generalizability of the proposed model and its potential to support clinical
decision making in low-resource settings. The compact architecture,
approximately 90 MB, and sub-minute per-volume inference time on consumer-grade
hardware further underscore its practicality for deployment in SSA health
systems. This work contributes toward closing the gap in equitable AI for
global health by empowering underserved regions with high-performing and
accessible medical imaging solutions.
comment: 11 pages, 7 figures
☆ FlexiD-Fuse: Flexible number of inputs multi-modal medical image fusion based on diffusion model
Different modalities of medical images provide unique physiological and
anatomical information for diseases. Multi-modal medical image fusion
integrates useful information from different complementary medical images with
different modalities, producing a fused image that comprehensively and
objectively reflects lesion characteristics to assist doctors in clinical
diagnosis. However, existing fusion methods can only handle a fixed number of
modality inputs, such as accepting only two-modal or tri-modal inputs, and
cannot directly process varying input quantities, which hinders their
application in clinical settings. To tackle this issue, we introduce
FlexiD-Fuse, a diffusion-based image fusion network designed to accommodate
flexible quantities of input modalities. It can end-to-end process two-modal
and tri-modal medical image fusion under the same weight. FlexiD-Fuse
transforms the diffusion fusion problem, which supports only fixed-condition
inputs, into a maximum likelihood estimation problem based on the diffusion
process and hierarchical Bayesian modeling. By incorporating the
Expectation-Maximization algorithm into the diffusion sampling iteration
process, FlexiD-Fuse can generate high-quality fused images with cross-modal
information from source images, independently of the number of input images. We
compared the latest two and tri-modal medical image fusion methods, tested them
on Harvard datasets, and evaluated them using nine popular metrics. The
experimental results show that our method achieves the best performance in
medical image fusion with varying inputs. Meanwhile, we conducted extensive
extension experiments on infrared-visible, multi-exposure, and multi-focus
image fusion tasks with arbitrary numbers, and compared them with the
perspective SOTA methods. The results of the extension experiments consistently
demonstrate the effectiveness and superiority of our method.
☆ Semantic Concentration for Self-Supervised Dense Representations Learning
Recent advances in image-level self-supervised learning (SSL) have made
significant progress, yet learning dense representations for patches remains
challenging. Mainstream methods encounter an over-dispersion phenomenon that
patches from the same instance/category scatter, harming downstream performance
on dense tasks. This work reveals that image-level SSL avoids over-dispersion
by involving implicit semantic concentration. Specifically, the non-strict
spatial alignment ensures intra-instance consistency, while shared patterns,
i.e., similar parts of within-class instances in the input space, ensure
inter-image consistency. Unfortunately, these approaches are infeasible for
dense SSL due to their spatial sensitivity and complicated scene-centric data.
These observations motivate us to explore explicit semantic concentration for
dense SSL. First, to break the strict spatial alignment, we propose to distill
the patch correspondences. Facing noisy and imbalanced pseudo labels, we
propose a noise-tolerant ranking loss. The core idea is extending the Average
Precision (AP) loss to continuous targets, such that its decision-agnostic and
adaptive focusing properties prevent the student model from being misled.
Second, to discriminate the shared patterns from complicated scenes, we propose
the object-aware filter to map the output space to an object-based space.
Specifically, patches are represented by learnable prototypes of objects via
cross-attention. Last but not least, empirical studies across various tasks
soundly support the effectiveness of our method. Code is available in
https://github.com/KID-7391/CoTAP.
☆ FS-Diff: Semantic guidance and clarity-aware simultaneous multimodal image fusion and super-resolution
As an influential information fusion and low-level vision technique, image
fusion integrates complementary information from source images to yield an
informative fused image. A few attempts have been made in recent years to
jointly realize image fusion and super-resolution. However, in real-world
applications such as military reconnaissance and long-range detection missions,
the target and background structures in multimodal images are easily corrupted,
with low resolution and weak semantic information, which leads to suboptimal
results in current fusion techniques. In response, we propose FS-Diff, a
semantic guidance and clarity-aware joint image fusion and super-resolution
method. FS-Diff unifies image fusion and super-resolution as a conditional
generation problem. It leverages semantic guidance from the proposed clarity
sensing mechanism for adaptive low-resolution perception and cross-modal
feature extraction. Specifically, we initialize the desired fused result as
pure Gaussian noise and introduce the bidirectional feature Mamba to extract
the global features of the multimodal images. Moreover, utilizing the source
images and semantics as conditions, we implement a random iterative denoising
process via a modified U-Net network. This network istrained for denoising at
multiple noise levels to produce high-resolution fusion results with
cross-modal features and abundant semantic information. We also construct a
powerful aerial view multiscene (AVMS) benchmark covering 600 pairs of images.
Extensive joint image fusion and super-resolution experiments on six public and
our AVMS datasets demonstrated that FS-Diff outperforms the state-of-the-art
methods at multiple magnifications and can recover richer details and semantics
in the fused images. The code is available at
https://github.com/XylonXu01/FS-Diff.
☆ Decoupling Clinical and Class-Agnostic Features for Reliable Few-Shot Adaptation under Shift
Medical vision-language models (VLMs) offer promise for clinical decision
support, yet their reliability under distribution shifts remains a major
concern for safe deployment. These models often learn task-agnostic
correlations due to variability in imaging protocols and free-text reports,
limiting their generalizability and increasing the risk of failure in
real-world settings. We propose DRiFt, a structured feature decoupling
framework that explicitly separates clinically relevant signals from
task-agnostic noise using parameter-efficient tuning (LoRA) and learnable
prompt tokens. To enhance cross-modal alignment and reduce uncertainty, we
curate high-quality, clinically grounded image-text pairs by generating
captions for a diverse medical dataset. Our approach improves in-distribution
performance by +11.4% Top-1 accuracy and +3.3% Macro-F1 over prior prompt-based
methods, while maintaining strong robustness across unseen datasets. Ablation
studies reveal that disentangling task-relevant features and careful alignment
significantly enhance model generalization and reduce unpredictable behavior
under domain shift. These insights contribute toward building safer, more
trustworthy VLMs for clinical use. The code is available at
https://github.com/rumaima/DRiFt.
☆ Unsupervised Integrated-Circuit Defect Segmentation via Image-Intrinsic Normality
Modern Integrated-Circuit(IC) manufacturing introduces diverse, fine-grained
defects that depress yield and reliability. Most industrial defect segmentation
compares a test image against an external normal set, a strategy that is
brittle for IC imagery where layouts vary across products and accurate
alignment is difficult. We observe that defects are predominantly local, while
each image still contains rich, repeatable normal patterns. We therefore
propose an unsupervised IC defect segmentation framework that requires no
external normal support. A learnable normal-information extractor aggregates
representative normal features from the test image, and a coherence loss
enforces their association with normal regions. Guided by these features, a
decoder reconstructs only normal content; the reconstruction residual then
segments defects. Pseudo-anomaly augmentation further stabilizes training.
Experiments on datasets from three IC process stages show consistent
improvements over existing approaches and strong robustness to product
variability.
☆ A Fully Automatic Framework for Intracranial Pressure Grading: Integrating Keyframe Identification, ONSD Measurement and Clinical Data
Intracranial pressure (ICP) elevation poses severe threats to cerebral
function, thus necessitating monitoring for timely intervention. While lumbar
puncture is the gold standard for ICP measurement, its invasiveness and
associated risks drive the need for non-invasive alternatives. Optic nerve
sheath diameter (ONSD) has emerged as a promising biomarker, as elevated ICP
directly correlates with increased ONSD. However, current clinical practices
for ONSD measurement suffer from inconsistency in manual operation,
subjectivity in optimal view selection, and variability in thresholding,
limiting their reliability. To address these challenges, we introduce a fully
automatic two-stage framework for ICP grading, integrating keyframe
identification, ONSD measurement and clinical data. Specifically, the fundus
ultrasound video processing stage performs frame-level anatomical segmentation,
rule-based keyframe identification guided by an international consensus
statement, and precise ONSD measurement. The intracranial pressure grading
stage then fuses ONSD metrics with clinical features to enable the prediction
of ICP grades, thereby demonstrating an innovative blend of interpretable
ultrasound analysis and multi-source data integration for objective clinical
evaluation. Experimental results demonstrate that our method achieves a
validation accuracy of $0.845 \pm 0.071$ (with standard deviation from
five-fold cross-validation) and an independent test accuracy of 0.786,
significantly outperforming conventional threshold-based method ($0.637 \pm
0.111$ validation accuracy, $0.429$ test accuracy). Through effectively
reducing operator variability and integrating multi-source information, our
framework establishes a reliable non-invasive approach for clinical ICP
evaluation, holding promise for improving patient management in acute
neurological conditions.
☆ Plug-and-play Diffusion Models for Image Compressive Sensing with Data Consistency Projection
We explore the connection between Plug-and-Play (PnP) methods and Denoising
Diffusion Implicit Models (DDIM) for solving ill-posed inverse problems, with a
focus on single-pixel imaging. We begin by identifying key distinctions between
PnP and diffusion models-particularly in their denoising mechanisms and
sampling procedures. By decoupling the diffusion process into three
interpretable stages: denoising, data consistency enforcement, and sampling, we
provide a unified framework that integrates learned priors with physical
forward models in a principled manner. Building upon this insight, we propose a
hybrid data-consistency module that linearly combines multiple PnP-style
fidelity terms. This hybrid correction is applied directly to the denoised
estimate, improving measurement consistency without disrupting the diffusion
sampling trajectory. Experimental results on single-pixel imaging tasks
demonstrate that our method achieves better reconstruction quality.
☆ Texture-aware Intrinsic Image Decomposition with Model- and Learning-based Priors
This paper aims to recover the intrinsic reflectance layer and shading layer
given a single image. Though this intrinsic image decomposition problem has
been studied for decades, it remains a significant challenge in cases of
complex scenes, i.e. spatially-varying lighting effect and rich textures. In
this paper, we propose a novel method for handling severe lighting and rich
textures in intrinsic image decomposition, which enables to produce
high-quality intrinsic images for real-world images. Specifically, we observe
that previous learning-based methods tend to produce texture-less and
over-smoothing intrinsic images, which can be used to infer the lighting and
texture information given a RGB image. In this way, we design a texture-guided
regularization term and formulate the decomposition problem into an
optimization framework, to separate the material textures and lighting effect.
We demonstrate that combining the novel texture-aware prior can produce
superior results to existing approaches.
☆ Classification of Driver Behaviour Using External Observation Techniques for Autonomous Vehicles
Road traffic accidents remain a significant global concern, with human error,
particularly distracted and impaired driving, among the leading causes. This
study introduces a novel driver behavior classification system that uses
external observation techniques to detect indicators of distraction and
impairment. The proposed framework employs advanced computer vision
methodologies, including real-time object tracking, lateral displacement
analysis, and lane position monitoring. The system identifies unsafe driving
behaviors such as excessive lateral movement and erratic trajectory patterns by
implementing the YOLO object detection model and custom lane estimation
algorithms. Unlike systems reliant on inter-vehicular communication, this
vision-based approach enables behavioral analysis of non-connected vehicles.
Experimental evaluations on diverse video datasets demonstrate the framework's
reliability and adaptability across varying road and environmental conditions.
☆ OmniEVA: Embodied Versatile Planner via Task-Adaptive 3D-Grounded and Embodiment-aware Reasoning
Yuecheng Liu, Dafeng Chi, Shiguang Wu, Zhanguang Zhang, Yuzheng Zhuang, Bowen Yang, He Zhu, Lingfeng Zhang, Pengwei Xie, David Gamaliel Arcos Bravo, Yingxue Zhang, Jianye Hao, Xingyue Quan
Recent advances in multimodal large language models (MLLMs) have opened new
opportunities for embodied intelligence, enabling multimodal understanding,
reasoning, and interaction, as well as continuous spatial decision-making.
Nevertheless, current MLLM-based embodied systems face two critical
limitations. First, Geometric Adaptability Gap: models trained solely on 2D
inputs or with hard-coded 3D geometry injection suffer from either insufficient
spatial information or restricted 2D generalization, leading to poor
adaptability across tasks with diverse spatial demands. Second, Embodiment
Constraint Gap: prior work often neglects the physical constraints and
capacities of real robots, resulting in task plans that are theoretically valid
but practically infeasible.To address these gaps, we introduce OmniEVA -- an
embodied versatile planner that enables advanced embodied reasoning and task
planning through two pivotal innovations: (1) a Task-Adaptive 3D Grounding
mechanism, which introduces a gated router to perform explicit selective
regulation of 3D fusion based on contextual requirements, enabling
context-aware 3D grounding for diverse embodied tasks. (2) an Embodiment-Aware
Reasoning framework that jointly incorporates task goals and embodiment
constraints into the reasoning loop, resulting in planning decisions that are
both goal-directed and executable. Extensive experimental results demonstrate
that OmniEVA not only achieves state-of-the-art general embodied reasoning
performance, but also exhibits a strong ability across a wide range of
downstream scenarios. Evaluations of a suite of proposed embodied benchmarks,
including both primitive and composite tasks, confirm its robust and versatile
planning capabilities. Project page: https://omnieva.github.io
☆ Exploring Pre-training Across Domains for Few-Shot Surgical Skill Assessment MICCAI 2025
Dimitrios Anastasiou, Razvan Caramalau, Nazir Sirajudeen, Matthew Boal, Philip Edwards, Justin Collins, John Kelly, Ashwin Sridhar, Maxine Tran, Faiz Mumtaz, Nevil Pavithran, Nader Francis, Danail Stoyanov, Evangelos B. Mazomenos
Automated surgical skill assessment (SSA) is a central task in surgical
computer vision. Developing robust SSA models is challenging due to the
scarcity of skill annotations, which are time-consuming to produce and require
expert consensus. Few-shot learning (FSL) offers a scalable alternative
enabling model development with minimal supervision, though its success
critically depends on effective pre-training. While widely studied for several
surgical downstream tasks, pre-training has remained largely unexplored in SSA.
In this work, we formulate SSA as a few-shot task and investigate how
self-supervised pre-training strategies affect downstream few-shot SSA
performance. We annotate a publicly available robotic surgery dataset with
Objective Structured Assessment of Technical Skill (OSATS) scores, and evaluate
various pre-training sources across three few-shot settings. We quantify domain
similarity and analyze how domain gap and the inclusion of procedure-specific
data into pre-training influence transferability. Our results show that small
but domain-relevant datasets can outperform large scale, less aligned ones,
achieving accuracies of 60.16%, 66.03%, and 73.65% in the 1-, 2-, and 5-shot
settings, respectively. Moreover, incorporating procedure-specific data into
pre-training with a domain-relevant external dataset significantly boosts
downstream performance, with an average gain of +1.22% in accuracy and +2.28%
in F1-score; however, applying the same strategy with less similar but
large-scale sources can instead lead to performance degradation. Code and
models are available at https://github.com/anastadimi/ssa-fsl.
comment: Accepted at MICCAI 2025 DEMI Workshop
☆ Fine-Grained Customized Fashion Design with Image-into-Prompt benchmark and dataset from LMM
Generative AI evolves the execution of complex workflows in industry, where
the large multimodal model empowers fashion design in the garment industry.
Current generation AI models magically transform brainstorming into fancy
designs easily, but the fine-grained customization still suffers from text
uncertainty without professional background knowledge from end-users. Thus, we
propose the Better Understanding Generation (BUG) workflow with LMM to
automatically create and fine-grain customize the cloth designs from chat with
image-into-prompt. Our framework unleashes users' creative potential beyond
words and also lowers the barriers of clothing design/editing without further
human involvement. To prove the effectiveness of our model, we propose a new
FashionEdit dataset that simulates the real-world clothing design workflow,
evaluated from generation similarity, user satisfaction, and quality. The code
and dataset: https://github.com/detectiveli/FashionEdit.
☆ Image Recognition with Vision and Language Embeddings of VLMs
Vision-language models (VLMs) have enabled strong zero-shot classification
through image-text alignment. Yet, their purely visual inference capabilities
remain under-explored. In this work, we conduct a comprehensive evaluation of
both language-guided and vision-only image classification with a diverse set of
dual-encoder VLMs, including both well-established and recent models such as
SigLIP 2 and RADIOv2.5. The performance is compared in a standard setup on the
ImageNet-1k validation set and its label-corrected variant. The key factors
affecting accuracy are analysed, including prompt design, class diversity, the
number of neighbours in k-NN, and reference set size. We show that language and
vision offer complementary strengths, with some classes favouring textual
prompts and others better handled by visual similarity. To exploit this
complementarity, we introduce a simple, learning-free fusion method based on
per-class precision that improves classification performance. The code is
available at: https://github.com/gonikisgo/bmvc2025-vlm-image-recognition.
☆ You Share Beliefs, I Adapt: Progressive Heterogeneous Collaborative Perception
Collaborative perception enables vehicles to overcome individual perception
limitations by sharing information, allowing them to see further and through
occlusions. In real-world scenarios, models on different vehicles are often
heterogeneous due to manufacturer variations. Existing methods for
heterogeneous collaborative perception address this challenge by fine-tuning
adapters or the entire network to bridge the domain gap. However, these methods
are impractical in real-world applications, as each new collaborator must
undergo joint training with the ego vehicle on a dataset before inference, or
the ego vehicle stores models for all potential collaborators in advance.
Therefore, we pose a new question: Can we tackle this challenge directly during
inference, eliminating the need for joint training? To answer this, we
introduce Progressive Heterogeneous Collaborative Perception (PHCP), a novel
framework that formulates the problem as few-shot unsupervised domain
adaptation. Unlike previous work, PHCP dynamically aligns features by
self-training an adapter during inference, eliminating the need for labeled
data and joint training. Extensive experiments on the OPV2V dataset demonstrate
that PHCP achieves strong performance across diverse heterogeneous scenarios.
Notably, PHCP achieves performance comparable to SOTA methods trained on the
entire dataset while using only a small amount of unlabeled data.
☆ Can Multimodal LLMs See Materials Clearly? A Multimodal Benchmark on Materials Characterization
Zhengzhao Lai, Youbin Zheng, Zhenyang Cai, Haonan Lyu, Jinpu Yang, Hongqing Liang, Yan Hu, Benyou Wang
Materials characterization is fundamental to acquiring materials information,
revealing the processing-microstructure-property relationships that guide
material design and optimization. While multimodal large language models
(MLLMs) have recently shown promise in generative and predictive tasks within
materials science, their capacity to understand real-world characterization
imaging data remains underexplored. To bridge this gap, we present MatCha, the
first benchmark for materials characterization image understanding, comprising
1,500 questions that demand expert-level domain expertise. MatCha encompasses
four key stages of materials research comprising 21 distinct tasks, each
designed to reflect authentic challenges faced by materials scientists. Our
evaluation of state-of-the-art MLLMs on MatCha reveals a significant
performance gap compared to human experts. These models exhibit degradation
when addressing questions requiring higher-level expertise and sophisticated
visual perception. Simple few-shot and chain-of-thought prompting struggle to
alleviate these limitations. These findings highlight that existing MLLMs still
exhibit limited adaptability to real-world materials characterization
scenarios. We hope MatCha will facilitate future research in areas such as new
material discovery and autonomous scientific agents. MatCha is available at
https://github.com/FreedomIntelligence/MatCha.
☆ Learning Object-Centric Representations in SAR Images with Multi-Level Feature Fusion
Synthetic aperture radar (SAR) images contain not only targets of interest
but also complex background clutter, including terrain reflections and speckle
noise. In many cases, such clutter exhibits intensity and patterns that
resemble targets, leading models to extract entangled or spurious features.
Such behavior undermines the ability to form clear target representations,
regardless of the classifier. To address this challenge, we propose a novel
object-centric learning (OCL) framework, named SlotSAR, that disentangles
target representations from background clutter in SAR images without mask
annotations. SlotSAR first extracts high-level semantic features from SARATR-X
and low-level scattering features from the wavelet scattering network in order
to obtain complementary multi-level representations for robust target
characterization. We further present a multi-level slot attention module that
integrates these low- and high-level features to enhance slot-wise
representation distinctiveness, enabling effective OCL. Experimental results
demonstrate that SlotSAR achieves state-of-the-art performance in SAR imagery
by preserving structural details compared to existing OCL methods.
comment: 12 pages, 5 figures
☆ Model-Agnostic Open-Set Air-to-Air Visual Object Detection for Reliable UAV Perception
Open-set detection is crucial for robust UAV autonomy in air-to-air object
detection under real-world conditions. Traditional closed-set detectors degrade
significantly under domain shifts and flight data corruption, posing risks to
safety-critical applications. We propose a novel, model-agnostic open-set
detection framework designed specifically for embedding-based detectors. The
method explicitly handles unknown object rejection while maintaining robustness
against corrupted flight data. It estimates semantic uncertainty via entropy
modeling in the embedding space and incorporates spectral normalization and
temperature scaling to enhance open-set discrimination. We validate our
approach on the challenging AOT aerial benchmark and through extensive
real-world flight tests. Comprehensive ablation studies demonstrate consistent
improvements over baseline methods, achieving up to a 10\% relative AUROC gain
compared to standard YOLO-based detectors. Additionally, we show that
background rejection further strengthens robustness without compromising
detection accuracy, making our solution particularly well-suited for reliable
UAV perception in dynamic air-to-air environments.
☆ Modality-Agnostic Input Channels Enable Segmentation of Brain lesions in Multimodal MRI with Sequences Unavailable During Training MICCAI 2025
Segmentation models are important tools for the detection and analysis of
lesions in brain MRI. Depending on the type of brain pathology that is imaged,
MRI scanners can acquire multiple, different image modalities (contrasts). Most
segmentation models for multimodal brain MRI are restricted to fixed modalities
and cannot effectively process new ones at inference. Some models generalize to
unseen modalities but may lose discriminative modality-specific information.
This work aims to develop a model that can perform inference on data that
contain image modalities unseen during training, previously seen modalities,
and heterogeneous combinations of both, thus allowing a user to utilize any
available imaging modalities. We demonstrate this is possible with a simple,
thus practical alteration to the U-net architecture, by integrating a
modality-agnostic input channel or pathway, alongside modality-specific input
channels. To train this modality-agnostic component, we develop an image
augmentation scheme that synthesizes artificial MRI modalities. Augmentations
differentially alter the appearance of pathological and healthy brain tissue to
create artificial contrasts between them while maintaining realistic anatomical
integrity. We evaluate the method using 8 MRI databases that include 5 types of
pathologies (stroke, tumours, traumatic brain injury, multiple sclerosis and
white matter hyperintensities) and 8 modalities (T1, T1+contrast, T2, PD, SWI,
DWI, ADC and FLAIR). The results demonstrate that the approach preserves the
ability to effectively process MRI modalities encountered during training,
while being able to process new, unseen modalities to improve its segmentation.
Project code: https://github.com/Anthony-P-Addison/AGN-MOD-SEG
comment: Accepted to MICCAI 2025, for the following workshop: ML-CDS 2025:
Multimodal Learning and Fusion Across Scales for Clinical Decision Support
☆ Visual Programmability: A Guide for Code-as-Thought in Chart Understanding
Chart understanding presents a critical test to the reasoning capabilities of
Vision-Language Models (VLMs). Prior approaches face critical limitations: some
rely on external tools, making them brittle and constrained by a predefined
toolkit, while others fine-tune specialist models that often adopt a single
reasoning strategy, such as text-based chain-of-thought (CoT). The intermediate
steps of text-based reasoning are difficult to verify, which complicates the
use of reinforcement-learning signals that reward factual accuracy. To address
this, we propose a Code-as-Thought (CaT) approach to represent the visual
information of a chart in a verifiable, symbolic format. Our key insight is
that this strategy must be adaptive: a fixed, code-only implementation
consistently fails on complex charts where symbolic representation is
unsuitable. This finding leads us to introduce Visual Programmability: a
learnable property that determines if a chart-question pair is better solved
with code or direct visual analysis. We implement this concept in an adaptive
framework where a VLM learns to choose between the CaT pathway and a direct
visual reasoning pathway. The selection policy of the model is trained with
reinforcement learning using a novel dual-reward system. This system combines a
data-accuracy reward to ground the model in facts and prevent numerical
hallucination, with a decision reward that teaches the model when to use each
strategy, preventing it from defaulting to a single reasoning mode. Experiments
demonstrate strong and robust performance across diverse chart-understanding
benchmarks. Our work shows that VLMs can be taught not only to reason but also
how to reason, dynamically selecting the optimal reasoning pathway for each
task.
☆ Unified Start, Personalized End: Progressive Pruning for Efficient 3D Medical Image Segmentation
3D medical image segmentation often faces heavy resource and time
consumption, limiting its scalability and rapid deployment in clinical
environments. Existing efficient segmentation models are typically static and
manually designed prior to training, which restricts their adaptability across
diverse tasks and makes it difficult to balance performance with resource
efficiency. In this paper, we propose PSP-Seg, a progressive pruning framework
that enables dynamic and efficient 3D segmentation. PSP-Seg begins with a
redundant model and iteratively prunes redundant modules through a combination
of block-wise pruning and a functional decoupling loss. We evaluate PSP-Seg on
five public datasets, benchmarking it against seven state-of-the-art models and
six efficient segmentation models. Results demonstrate that the lightweight
variant, PSP-Seg-S, achieves performance on par with nnU-Net while reducing GPU
memory usage by 42-45%, training time by 29-48%, and parameter number by 83-87%
across all datasets. These findings underscore PSP-Seg's potential as a
cost-effective yet high-performing alternative for widespread clinical
application.
comment: 15 pages, 8 figures
☆ DATE: Dynamic Absolute Time Enhancement for Long Video Understanding
Long video understanding remains a fundamental challenge for multimodal large
language models (MLLMs), particularly in tasks requiring precise temporal
reasoning and event localization. Existing approaches typically adopt uniform
frame sampling and rely on implicit position encodings to model temporal order.
However, these methods struggle with long-range dependencies, leading to
critical information loss and degraded temporal comprehension. In this paper,
we propose Dynamic Absolute Time Enhancement (DATE) that enhances temporal
awareness in MLLMs through the Timestamp Injection Mechanism (TIM) and a
semantically guided Temporal-Aware Similarity Sampling (TASS) strategy.
Specifically, we interleave video frame embeddings with textual timestamp
tokens to construct a continuous temporal reference system. We further
reformulate the video sampling problem as a vision-language retrieval task and
introduce a two-stage algorithm to ensure both semantic relevance and temporal
coverage: enriching each query into a descriptive caption to better align with
the vision feature, and sampling key event with a similarity-driven temporally
regularized greedy strategy. Our method achieves remarkable improvements w.r.t.
absolute time understanding and key event localization, resulting in
state-of-the-art performance among 7B and 72B models on hour-long video
benchmarks. Particularly, our 7B model even exceeds many 72B models on some
benchmarks.
☆ Towards Better Dental AI: A Multimodal Benchmark and Instruction Dataset for Panoramic X-ray Analysis
Jing Hao, Yuxuan Fan, Yanpeng Sun, Kaixin Guo, Lizhuo Lin, Jinrong Yang, Qi Yong H. Ai, Lun M. Wong, Hao Tang, Kuo Feng Hung
Recent advances in large vision-language models (LVLMs) have demonstrated
strong performance on general-purpose medical tasks. However, their
effectiveness in specialized domains such as dentistry remains underexplored.
In particular, panoramic X-rays, a widely used imaging modality in oral
radiology, pose interpretative challenges due to dense anatomical structures
and subtle pathological cues, which are not captured by existing medical
benchmarks or instruction datasets. To this end, we introduce MMOral, the first
large-scale multimodal instruction dataset and benchmark tailored for panoramic
X-ray interpretation. MMOral consists of 20,563 annotated images paired with
1.3 million instruction-following instances across diverse task types,
including attribute extraction, report generation, visual question answering,
and image-grounded dialogue. In addition, we present MMOral-Bench, a
comprehensive evaluation suite covering five key diagnostic dimensions in
dentistry. We evaluate 64 LVLMs on MMOral-Bench and find that even the
best-performing model, i.e., GPT-4o, only achieves 41.45% accuracy, revealing
significant limitations of current models in this domain. To promote the
progress of this specific domain, we also propose OralGPT, which conducts
supervised fine-tuning (SFT) upon Qwen2.5-VL-7B with our meticulously curated
MMOral instruction dataset. Remarkably, a single epoch of SFT yields
substantial performance enhancements for LVLMs, e.g., OralGPT demonstrates a
24.73% improvement. Both MMOral and OralGPT hold significant potential as a
critical foundation for intelligent dentistry and enable more clinically
impactful multimodal AI systems in the dental field. The dataset, model,
benchmark, and evaluation suite are available at
https://github.com/isbrycee/OralGPT.
comment: 40 pages, 26 figures, 9 tables
☆ CoAtNeXt:An Attention-Enhanced ConvNeXtV2-Transformer Hybrid Model for Gastric Tissue Classification
Background and objective Early diagnosis of gastric diseases is crucial to
prevent fatal outcomes. Although histopathologic examination remains the
diagnostic gold standard, it is performed entirely manually, making evaluations
labor-intensive and prone to variability among pathologists. Critical findings
may be missed, and lack of standard procedures reduces consistency. These
limitations highlight the need for automated, reliable, and efficient methods
for gastric tissue analysis. Methods In this study, a novel hybrid model named
CoAtNeXt was proposed for the classification of gastric tissue images. The
model is built upon the CoAtNet architecture by replacing its MBConv layers
with enhanced ConvNeXtV2 blocks. Additionally, the Convolutional Block
Attention Module (CBAM) is integrated to improve local feature extraction
through channel and spatial attention mechanisms. The architecture was scaled
to achieve a balance between computational efficiency and classification
performance. CoAtNeXt was evaluated on two publicly available datasets,
HMU-GC-HE-30K for eight-class classification and GasHisSDB for binary
classification, and was compared against 10 Convolutional Neural Networks
(CNNs) and ten Vision Transformer (ViT) models. Results CoAtNeXt achieved
96.47% accuracy, 96.60% precision, 96.47% recall, 96.45% F1 score, and 99.89%
AUC on HMU-GC-HE-30K. On GasHisSDB, it reached 98.29% accuracy, 98.07%
precision, 98.41% recall, 98.23% F1 score, and 99.90% AUC. It outperformed all
CNN and ViT models tested and surpassed previous studies in the literature.
Conclusion Experimental results show that CoAtNeXt is a robust architecture for
histopathological classification of gastric tissue images, providing
performance on binary and multiclass. Its highlights its potential to assist
pathologists by enhancing diagnostic accuracy and reducing workload.
☆ Virtual staining for 3D X-ray histology of bone implants
Sarah C. Irvine, Christian Lucas, Diana Krüger, Bianca Guedert, Julian Moosmann, Berit Zeller-Plumhoff
Three-dimensional X-ray histology techniques offer a non-invasive alternative
to conventional 2D histology, enabling volumetric imaging of biological tissues
without the need for physical sectioning or chemical staining. However, the
inherent greyscale image contrast of X-ray tomography limits its biochemical
specificity compared to traditional histological stains. Within digital
pathology, deep learning-based virtual staining has demonstrated utility in
simulating stained appearances from label-free optical images. In this study,
we extend virtual staining to the X-ray domain by applying cross-modality image
translation to generate artificially stained slices from
synchrotron-radiation-based micro-CT scans. Using over 50 co-registered image
pairs of micro-CT and toluidine blue-stained histology from bone-implant
samples, we trained a modified CycleGAN network tailored for limited paired
data. Whole slide histology images were downsampled to match the voxel size of
the CT data, with on-the-fly data augmentation for patch-based training. The
model incorporates pixelwise supervision and greyscale consistency terms,
producing histologically realistic colour outputs while preserving
high-resolution structural detail. Our method outperformed Pix2Pix and standard
CycleGAN baselines across SSIM, PSNR, and LPIPS metrics. Once trained, the
model can be applied to full CT volumes to generate virtually stained 3D
datasets, enhancing interpretability without additional sample preparation.
While features such as new bone formation were able to be reproduced, some
variability in the depiction of implant degradation layers highlights the need
for further training data and refinement. This work introduces virtual staining
to 3D X-ray imaging and offers a scalable route for chemically informative,
label-free tissue characterisation in biomedical research.
☆ Medverse: A Universal Model for Full-Resolution 3D Medical Image Segmentation, Transformation and Enhancement
In-context learning (ICL) offers a promising paradigm for universal medical
image analysis, enabling models to perform diverse image processing tasks
without retraining. However, current ICL models for medical imaging remain
limited in two critical aspects: they cannot simultaneously achieve
high-fidelity predictions and global anatomical understanding, and there is no
unified model trained across diverse medical imaging tasks (e.g., segmentation
and enhancement) and anatomical regions. As a result, the full potential of ICL
in medical imaging remains underexplored. Thus, we present \textbf{Medverse}, a
universal ICL model for 3D medical imaging, trained on 22 datasets covering
diverse tasks in universal image segmentation, transformation, and enhancement
across multiple organs, imaging modalities, and clinical centers. Medverse
employs a next-scale autoregressive in-context learning framework that
progressively refines predictions from coarse to fine, generating consistent,
full-resolution volumetric outputs and enabling multi-scale anatomical
awareness. We further propose a blockwise cross-attention module that
facilitates long-range interactions between context and target inputs while
preserving computational efficiency through spatial sparsity. Medverse is
extensively evaluated on a broad collection of held-out datasets covering
previously unseen clinical centers, organs, species, and imaging modalities.
Results demonstrate that Medverse substantially outperforms existing ICL
baselines and establishes a novel paradigm for in-context learning. Code and
model weights will be made publicly available. Our model are publicly available
at https://github.com/jiesihu/Medverse.
☆ Dynamic Structural Recovery Parameters Enhance Prediction of Visual Outcomes After Macular Hole Surgery
Yinzheng Zhao, Zhihao Zhao, Rundong Jiang, Louisa Sackewitz, Quanmin Liang, Mathias Maier, Daniel Zapp, Peter Charbel Issa, Mohammad Ali Nasseri
Purpose: To introduce novel dynamic structural parameters and evaluate their
integration within a multimodal deep learning (DL) framework for predicting
postoperative visual recovery in idiopathic full-thickness macular hole (iFTMH)
patients. Methods: We utilized a publicly available longitudinal OCT dataset at
five stages (preoperative, 2 weeks, 3 months, 6 months, and 12 months). A stage
specific segmentation model delineated related structures, and an automated
pipeline extracted quantitative, composite, qualitative, and dynamic features.
Binary logistic regression models, constructed with and without dynamic
parameters, assessed their incremental predictive value for best-corrected
visual acuity (BCVA). A multimodal DL model combining clinical variables,
OCT-derived features, and raw OCT images was developed and benchmarked against
regression models. Results: The segmentation model achieved high accuracy
across all timepoints (mean Dice > 0.89). Univariate and multivariate analyses
identified base diameter, ellipsoid zone integrity, and macular hole area as
significant BCVA predictors (P < 0.05). Incorporating dynamic recovery rates
consistently improved logistic regression AUC, especially at the 3-month
follow-up. The multimodal DL model outperformed logistic regression, yielding
higher AUCs and overall accuracy at each stage. The difference is as high as
0.12, demonstrating the complementary value of raw image volume and dynamic
parameters. Conclusions: Integrating dynamic parameters into the multimodal DL
model significantly enhances the accuracy of predictions. This fully automated
process therefore represents a promising clinical decision support tool for
personalized postoperative management in macular hole surgery.
comment: TVST
☆ MGTraj: Multi-Granularity Goal-Guided Human Trajectory Prediction with Recursive Refinement Network
Accurate human trajectory prediction is crucial for robotics navigation and
autonomous driving. Recent research has demonstrated that incorporating goal
guidance significantly enhances prediction accuracy by reducing uncertainty and
leveraging prior knowledge. Most goal-guided approaches decouple the prediction
task into two stages: goal prediction and subsequent trajectory completion
based on the predicted goal, which operate at extreme granularities:
coarse-grained goal prediction forecasts the overall intention, while
fine-grained trajectory completion needs to generate the positions for all
future timesteps. The potential utility of intermediate temporal granularity
remains largely unexplored, which motivates multi-granularity trajectory
modeling. While prior work has shown that multi-granularity representations
capture diverse scales of human dynamics and motion patterns, effectively
integrating this concept into goal-guided frameworks remains challenging. In
this paper, we propose MGTraj, a novel Multi-Granularity goal-guided model for
human Trajectory prediction. MGTraj recursively encodes trajectory proposals
from coarse to fine granularity levels. At each level, a transformer-based
recursive refinement network (RRN) captures features and predicts progressive
refinements. Features across different granularities are integrated using a
weight-sharing strategy, and velocity prediction is employed as an auxiliary
task to further enhance performance. Comprehensive experimental results in
EHT/UCY and Stanford Drone Dataset indicate that MGTraj outperforms baseline
methods and achieves state-of-the-art performance among goal-guided methods.
☆ Breaking the Statistical Similarity Trap in Extreme Convection Detection
Current evaluation metrics for deep learning weather models create a
"Statistical Similarity Trap", rewarding blurry predictions while missing rare,
high-impact events. We provide quantitative evidence of this trap, showing
sophisticated baselines achieve 97.9% correlation yet 0.00 CSI for dangerous
convection detection. We introduce DART (Dual Architecture for Regression
Tasks), a framework addressing the challenge of transforming coarse atmospheric
forecasts into high-resolution satellite brightness temperature fields
optimized for extreme convection detection (below 220 K). DART employs
dual-decoder architecture with explicit background/extreme decomposition,
physically motivated oversampling, and task-specific loss functions. We present
four key findings: (1) empirical validation of the Statistical Similarity Trap
across multiple sophisticated baselines; (2) the "IVT Paradox", removing
Integrated Water Vapor Transport, widely regarded as essential for atmospheric
river analysis, improves extreme convection detection by 270%; (3)
architectural necessity demonstrated through operational flexibility (DART
achieves CSI = 0.273 with bias = 2.52 vs. 6.72 for baselines at equivalent
CSI), and (4) real-world validation with the August 2023 Chittagong flooding
disaster as a case study. To our knowledge, this is the first work to
systematically address this hybrid conversion-segmentation-downscaling task,
with no direct prior benchmarks identified in existing literature. Our
validation against diverse statistical and deep learning baselines sufficiently
demonstrates DART's specialized design. The framework enables precise
operational calibration through beta-tuning, trains in under 10 minutes on
standard hardware, and integrates seamlessly with existing meteorological
workflows, demonstrating a pathway toward trustworthy AI for extreme weather
preparedness.
comment: 43 pages, 7 figures
☆ VQualA 2025 Challenge on Visual Quality Comparison for Large Multimodal Models: Methods and Results ICCV
Hanwei Zhu, Haoning Wu, Zicheng Zhang, Lingyu Zhu, Yixuan Li, Peilin Chen, Shiqi Wang, Chris Wei Zhou, Linhan Cao, Wei Sun, Xiangyang Zhu, Weixia Zhang, Yucheng Zhu, Jing Liu, Dandan Zhu, Guangtao Zhai, Xiongkuo Min, Zhichao Zhang, Xinyue Li, Shubo Xu, Anh Dao, Yifan Li, Hongyuan Yu, Jiaojiao Yi, Yiding Tian, Yupeng Wu, Feiran Sun, Lijuan Liao, Song Jiang
This paper presents a summary of the VQualA 2025 Challenge on Visual Quality
Comparison for Large Multimodal Models (LMMs), hosted as part of the ICCV 2025
Workshop on Visual Quality Assessment. The challenge aims to evaluate and
enhance the ability of state-of-the-art LMMs to perform open-ended and detailed
reasoning about visual quality differences across multiple images. To this end,
the competition introduces a novel benchmark comprising thousands of
coarse-to-fine grained visual quality comparison tasks, spanning single images,
pairs, and multi-image groups. Each task requires models to provide accurate
quality judgments. The competition emphasizes holistic evaluation protocols,
including 2AFC-based binary preference and multi-choice questions (MCQs).
Around 100 participants submitted entries, with five models demonstrating the
emerging capabilities of instruction-tuned LMMs on quality assessment. This
challenge marks a significant step toward open-domain visual quality reasoning
and comparison and serves as a catalyst for future research on interpretable
and human-aligned quality evaluation systems.
comment: ICCV VQualA Workshop 2025
☆ Dark-ISP: Enhancing RAW Image Processing for Low-Light Object Detection
Low-light Object detection is crucial for many real-world applications but
remains challenging due to degraded image quality. While recent studies have
shown that RAW images offer superior potential over RGB images, existing
approaches either use RAW-RGB images with information loss or employ complex
frameworks. To address these, we propose a lightweight and self-adaptive Image
Signal Processing (ISP) plugin, Dark-ISP, which directly processes Bayer RAW
images in dark environments, enabling seamless end-to-end training for object
detection. Our key innovations are: (1) We deconstruct conventional ISP
pipelines into sequential linear (sensor calibration) and nonlinear (tone
mapping) sub-modules, recasting them as differentiable components optimized
through task-driven losses. Each module is equipped with content-aware
adaptability and physics-informed priors, enabling automatic RAW-to-RGB
conversion aligned with detection objectives. (2) By exploiting the ISP
pipeline's intrinsic cascade structure, we devise a Self-Boost mechanism that
facilitates cooperation between sub-modules. Through extensive experiments on
three RAW image datasets, we demonstrate that our method outperforms
state-of-the-art RGB- and RAW-based detection approaches, achieving superior
results with minimal parameters in challenging low-light environments.
comment: 11 pages, 6 figures, conference
☆ Bridging the Gap Between Ideal and Real-world Evaluation: Benchmarking AI-Generated Image Detection in Challenging Scenarios ICCV2025
With the rapid advancement of generative models, highly realistic image
synthesis has posed new challenges to digital security and media credibility.
Although AI-generated image detection methods have partially addressed these
concerns, a substantial research gap remains in evaluating their performance
under complex real-world conditions. This paper introduces the Real-World
Robustness Dataset (RRDataset) for comprehensive evaluation of detection models
across three dimensions: 1) Scenario Generalization: RRDataset encompasses
high-quality images from seven major scenarios (War and Conflict, Disasters and
Accidents, Political and Social Events, Medical and Public Health, Culture and
Religion, Labor and Production, and everyday life), addressing existing dataset
gaps from a content perspective. 2) Internet Transmission Robustness: examining
detector performance on images that have undergone multiple rounds of sharing
across various social media platforms. 3) Re-digitization Robustness: assessing
model effectiveness on images altered through four distinct re-digitization
methods. We benchmarked 17 detectors and 10 vision-language models (VLMs) on
RRDataset and conducted a large-scale human study involving 192 participants to
investigate human few-shot learning capabilities in detecting AI-generated
images. The benchmarking results reveal the limitations of current AI detection
methods under real-world conditions and underscore the importance of drawing on
human adaptability to develop more robust detection algorithms.
comment: ICCV2025
☆ Adaptive Pareto-Optimal Token Merging for Edge Transformer Models in Semantic Communication IEEE
Large-scale transformer models have emerged as a powerful tool for semantic
communication systems, enabling edge devices to extract rich representations
for robust inference across noisy wireless channels. However, their substantial
computational demands remain a major barrier to practical deployment in
resource-constrained 6G networks. In this paper, we present a training-free
framework for adaptive token merging in pretrained vision transformers to
jointly reduce inference time and transmission resource usage. We formulate the
selection of per-layer merging proportions as a multi-objective optimization
problem to balance accuracy and computational cost. We employ Gaussian
process-based Bayesian optimization to construct a Pareto frontier of optimal
configurations, enabling flexible runtime adaptation to dynamic application
requirements and channel conditions. Extensive experiments demonstrate that our
method consistently outperforms other baselines and achieves significant
reductions in floating-point operations while maintaining competitive accuracy
across a wide range of signal-to-noise ratio (SNR) conditions. Additional
results highlight the effectiveness of adaptive policies that adjust merging
aggressiveness in response to channel quality, providing a practical mechanism
to trade off latency and semantic fidelity on demand. These findings establish
a scalable and efficient approach for deploying transformer-based semantic
communication in future edge intelligence systems.
comment: To appear in IEEE Globecom 2025
☆ CWSSNet: Hyperspectral Image Classification Enhanced by Wavelet Domain Convolution
Hyperspectral remote sensing technology has significant application value in
fields such as forestry ecology and precision agriculture, while also putting
forward higher requirements for fine ground object classification. However,
although hyperspectral images are rich in spectral information and can improve
recognition accuracy, they tend to cause prominent feature redundancy due to
their numerous bands, high dimensionality, and spectral mixing characteristics.
To address this, this study used hyperspectral images from the ZY1F satellite
as a data source and selected Yugan County, Shangrao City, Jiangxi Province as
the research area to perform ground object classification research. A
classification framework named CWSSNet was proposed, which integrates 3D
spectral-spatial features and wavelet convolution. This framework integrates
multimodal information us-ing a multiscale convolutional attention module and
breaks through the classification performance bottleneck of traditional methods
by introducing multi-band decomposition and convolution operations in the
wavelet domain. The experiments showed that CWSSNet achieved 74.50\%, 82.73\%,
and 84.94\% in mean Intersection over Union (mIoU), mean Accuracy (mAcc), and
mean F1-score (mF1) respectively in Yugan County. It also obtained the highest
Intersection over Union (IoU) in the classifica-tion of water bodies,
vegetation, and bare land, demonstrating good robustness. Additionally, when
the training set proportion was 70\%, the increase in training time was
limited, and the classification effect was close to the optimal level,
indicating that the model maintains reliable performance under small-sample
training conditions.
☆ A Knowledge Noise Mitigation Framework for Knowledge-based Visual Question Answering IEEE
Knowledge-based visual question answering (KB-VQA) requires a model to
understand images and utilize external knowledge to provide accurate answers.
Existing approaches often directly augment models with retrieved information
from knowledge sources while ignoring substantial knowledge redundancy, which
introduces noise into the answering process. To address this, we propose a
training-free framework with knowledge focusing for KB-VQA, that mitigates the
impact of noise by enhancing knowledge relevance and reducing redundancy.
First, for knowledge retrieval, our framework concludes essential parts from
the image-question pairs, creating low-noise queries that enhance the retrieval
of highly relevant knowledge. Considering that redundancy still persists in the
retrieved knowledge, we then prompt large models to identify and extract
answer-beneficial segments from knowledge. In addition, we introduce a
selective knowledge integration strategy, allowing the model to incorporate
knowledge only when it lacks confidence in answering the question, thereby
mitigating the influence of redundant information. Our framework enables the
acquisition of accurate and critical knowledge, and extensive experiments
demonstrate that it outperforms state-of-the-art methods.
comment: Accepted by the IEEE International Conference on Multimedia and Expo
(ICME 2025) for oral presentation. \copyright\ 2025 IEEE. Personal use of
this material is permitted. Permission from IEEE must be obtained for all
other uses
☆ RT-DETR++ for UAV Object Detection
Object detection in unmanned aerial vehicle (UAV) imagery presents
significant challenges. Issues such as densely packed small objects, scale
variations, and occlusion are commonplace. This paper introduces RT-DETR++,
which enhances the encoder component of the RT-DETR model. Our improvements
focus on two key aspects. First, we introduce a channel-gated attention-based
upsampling/downsampling (AU/AD) mechanism. This dual-path system minimizes
errors and preserves details during feature layer propagation. Second, we
incorporate CSP-PAC during feature fusion. This technique employs parallel
hollow convolutions to process local and contextual information within the same
layer, facilitating the integration of multi-scale features. Evaluation
demonstrates that our novel neck design achieves superior performance in
detecting small and densely packed objects. The model maintains sufficient
speed for real-time detection without increasing computational complexity. This
study provides an effective approach for feature encoding design in real-time
detection systems.
☆ Mind Meets Space: Rethinking Agentic Spatial Intelligence from a Neuroscience-inspired Perspective
Bui Duc Manh, Soumyaratna Debnath, Zetong Zhang, Shriram Damodaran, Arvind Kumar, Yueyi Zhang, Lu Mi, Erik Cambria, Lin Wang
Recent advances in agentic AI have led to systems capable of autonomous task
execution and language-based reasoning, yet their spatial reasoning abilities
remain limited and underexplored, largely constrained to symbolic and
sequential processing. In contrast, human spatial intelligence, rooted in
integrated multisensory perception, spatial memory, and cognitive maps, enables
flexible, context-aware decision-making in unstructured environments.
Therefore, bridging this gap is critical for advancing Agentic Spatial
Intelligence toward better interaction with the physical 3D world. To this end,
we first start from scrutinizing the spatial neural models as studied in
computational neuroscience, and accordingly introduce a novel computational
framework grounded in neuroscience principles. This framework maps core
biological functions to six essential computation modules: bio-inspired
multimodal sensing, multi-sensory integration, egocentric-allocentric
conversion, an artificial cognitive map, spatial memory, and spatial reasoning.
Together, these modules form a perspective landscape for agentic spatial
reasoning capability across both virtual and physical environments. On top, we
conduct a framework-guided analysis of recent methods, evaluating their
relevance to each module and identifying critical gaps that hinder the
development of more neuroscience-grounded spatial reasoning modules. We further
examine emerging benchmarks and datasets and explore potential application
domains ranging from virtual to embodied systems, such as robotics. Finally, we
outline potential research directions, emphasizing the promising roadmap that
can generalize spatial reasoning across dynamic or unstructured environments.
We hope this work will benefit the research community with a
neuroscience-grounded perspective and a structured pathway. Our project page
can be found at Github.
comment: 54 pages, journal
☆ OCELOT 2023: Cell Detection from Cell-Tissue Interaction Challenge
JaeWoong Shin, Jeongun Ryu, Aaron Valero Puche, Jinhee Lee, Biagio Brattoli, Wonkyung Jung, Soo Ick Cho, Kyunghyun Paeng, Chan-Young Ock, Donggeun Yoo, Zhaoyang Li, Wangkai Li, Huayu Mai, Joshua Millward, Zhen He, Aiden Nibali, Lydia Anette Schoenpflug, Viktor Hendrik Koelzer, Xu Shuoyu, Ji Zheng, Hu Bin, Yu-Wen Lo, Ching-Hui Yang, Sérgio Pereira
Pathologists routinely alternate between different magnifications when
examining Whole-Slide Images, allowing them to evaluate both broad tissue
morphology and intricate cellular details to form comprehensive diagnoses.
However, existing deep learning-based cell detection models struggle to
replicate these behaviors and learn the interdependent semantics between
structures at different magnifications. A key barrier in the field is the lack
of datasets with multi-scale overlapping cell and tissue annotations. The
OCELOT 2023 challenge was initiated to gather insights from the community to
validate the hypothesis that understanding cell and tissue (cell-tissue)
interactions is crucial for achieving human-level performance, and to
accelerate the research in this field. The challenge dataset includes
overlapping cell detection and tissue segmentation annotations from six organs,
comprising 673 pairs sourced from 306 The Cancer Genome Atlas (TCGA)
Whole-Slide Images with hematoxylin and eosin staining, divided into training,
validation, and test subsets. Participants presented models that significantly
enhanced the understanding of cell-tissue relationships. Top entries achieved
up to a 7.99 increase in F1-score on the test set compared to the baseline
cell-only model that did not incorporate cell-tissue relationships. This is a
substantial improvement in performance over traditional cell-only detection
methods, demonstrating the need for incorporating multi-scale semantics into
the models. This paper provides a comparative analysis of the methods used by
participants, highlighting innovative strategies implemented in the OCELOT 2023
challenge.
comment: This is the accepted manuscript of an article published in Medical
Image Analysis (Elsevier). The final version is available at:
https://doi.org/10.1016/j.media.2025.103751
☆ Video Understanding by Design: How Datasets Shape Architectures and Insights
Video understanding has advanced rapidly, fueled by increasingly complex
datasets and powerful architectures. Yet existing surveys largely classify
models by task or family, overlooking the structural pressures through which
datasets guide architectural evolution. This survey is the first to adopt a
dataset-driven perspective, showing how motion complexity, temporal span,
hierarchical composition, and multimodal richness impose inductive biases that
models should encode. We reinterpret milestones, from two-stream and 3D CNNs to
sequential, transformer, and multimodal foundation models, as concrete
responses to these dataset-driven pressures. Building on this synthesis, we
offer practical guidance for aligning model design with dataset invariances
while balancing scalability and task demands. By unifying datasets, inductive
biases, and architectures into a coherent framework, this survey provides both
a comprehensive retrospective and a prescriptive roadmap for advancing
general-purpose video understanding.
comment: Research report
☆ Objectness Similarity: Capturing Object-Level Fidelity in 3D Scene Evaluation ICCV 2025
This paper presents Objectness SIMilarity (OSIM), a novel evaluation metric
for 3D scenes that explicitly focuses on "objects," which are fundamental units
of human visual perception. Existing metrics assess overall image quality,
leading to discrepancies with human perception. Inspired by neuropsychological
insights, we hypothesize that human recognition of 3D scenes fundamentally
involves attention to individual objects. OSIM enables object-centric
evaluations by leveraging an object detection model and its feature
representations to quantify the "objectness" of each object in the scene. Our
user study demonstrates that OSIM aligns more closely with human perception
compared to existing metrics. We also analyze the characteristics of OSIM using
various approaches. Moreover, we re-evaluate recent 3D reconstruction and
generation models under a standardized experimental setup to clarify
advancements in this field. The code is available at
https://github.com/Objectness-Similarity/OSIM.
comment: Accepted by the ICCV 2025 UniLight Workshop
☆ Noise-Robust Topology Estimation of 2D Image Data via Neural Networks and Persistent Homology
Persistent Homology (PH) and Artificial Neural Networks (ANNs) offer
contrasting approaches to inferring topological structure from data. In this
study, we examine the noise robustness of a supervised neural network trained
to predict Betti numbers in 2D binary images. We compare an ANN approach
against a PH pipeline based on cubical complexes and the Signed Euclidean
Distance Transform (SEDT), which is a widely adopted strategy for noise-robust
topological analysis. Using one synthetic and two real-world datasets, we show
that ANNs can outperform this PH approach under noise, likely due to their
capacity to learn contextual and geometric priors from training data. Though
still emerging, the use of ANNs for topology estimation offers a compelling
alternative to PH under structural noise.
comment: 12 pages
☆ ALL-PET: A Low-resource and Low-shot PET Foundation Model in the Projection Domain
Building large-scale foundation model for PET imaging is hindered by limited
access to labeled data and insufficient computational resources. To overcome
data scarcity and efficiency limitations, we propose ALL-PET, a low-resource,
low-shot PET foundation model operating directly in the projection domain.
ALL-PET leverages a latent diffusion model (LDM) with three key innovations.
First, we design a Radon mask augmentation strategy (RMAS) that generates over
200,000 structurally diverse training samples by projecting randomized
image-domain masks into sinogram space, significantly improving generalization
with minimal data. This is extended by a dynamic multi-mask (DMM) mechanism
that varies mask quantity and distribution, enhancing data diversity without
added model complexity. Second, we implement positive/negative mask constraints
to embed strict geometric consistency, reducing parameter burden while
preserving generation quality. Third, we introduce transparent medical
attention (TMA), a parameter-free, geometry-driven mechanism that enhances
lesion-related regions in raw projection data. Lesion-focused attention maps
are derived from coarse segmentation, covering both hypermetabolic and
hypometabolic areas, and projected into sinogram space for physically
consistent guidance. The system supports clinician-defined ROI adjustments,
ensuring flexible, interpretable, and task-adaptive emphasis aligned with PET
acquisition physics. Experimental results show ALL-PET achieves high-quality
sinogram generation using only 500 samples, with performance comparable to
models trained on larger datasets. ALL-PET generalizes across tasks including
low-dose reconstruction, attenuation correction, delayed-frame prediction, and
tracer separation, operating efficiently with memory use under 24GB.
☆ Gradient-Attention Guided Dual-Masking Synergetic Framework for Robust Text-based Person Retrieval EMNLP2025
Although Contrastive Language-Image Pre-training (CLIP) exhibits strong
performance across diverse vision tasks, its application to person
representation learning faces two critical challenges: (i) the scarcity of
large-scale annotated vision-language data focused on person-centric images,
and (ii) the inherent limitations of global contrastive learning, which
struggles to maintain discriminative local features crucial for fine-grained
matching while remaining vulnerable to noisy text tokens. This work advances
CLIP for person representation learning through synergistic improvements in
data curation and model architecture. First, we develop a noise-resistant data
construction pipeline that leverages the in-context learning capabilities of
MLLMs to automatically filter and caption web-sourced images. This yields
WebPerson, a large-scale dataset of 5M high-quality person-centric image-text
pairs. Second, we introduce the GA-DMS (Gradient-Attention Guided Dual-Masking
Synergetic) framework, which improves cross-modal alignment by adaptively
masking noisy textual tokens based on the gradient-attention similarity score.
Additionally, we incorporate masked token prediction objectives that compel the
model to predict informative text tokens, enhancing fine-grained semantic
representation learning. Extensive experiments show that GA-DMS achieves
state-of-the-art performance across multiple benchmarks.
comment: Accepted by EMNLP2025 Main
☆ Zero-shot Hierarchical Plant Segmentation via Foundation Segmentation Models and Text-to-image Attention WACV 2026
Junhao Xing, Ryohei Miyakawa, Yang Yang, Xinpeng Liu, Risa Shinoda, Hiroaki Santo, Yosuke Toda, Fumio Okura
Foundation segmentation models achieve reasonable leaf instance extraction
from top-view crop images without training (i.e., zero-shot). However,
segmenting entire plant individuals with each consisting of multiple
overlapping leaves remains challenging. This problem is referred to as a
hierarchical segmentation task, typically requiring annotated training
datasets, which are often species-specific and require notable human labor. To
address this, we introduce ZeroPlantSeg, a zero-shot segmentation for
rosette-shaped plant individuals from top-view images. We integrate a
foundation segmentation model, extracting leaf instances, and a vision-language
model, reasoning about plants' structures to extract plant individuals without
additional training. Evaluations on datasets with multiple plant species,
growth stages, and shooting environments demonstrate that our method surpasses
existing zero-shot methods and achieves better cross-domain performance than
supervised methods. Implementations are available at
https://github.com/JunhaoXing/ZeroPlantSeg.
comment: WACV 2026 accepted
☆ FPI-Det: a face--phone Interaction Dataset for phone-use detection and understanding
The widespread use of mobile devices has created new challenges for vision
systems in safety monitoring, workplace productivity assessment, and attention
management. Detecting whether a person is using a phone requires not only
object recognition but also an understanding of behavioral context, which
involves reasoning about the relationship between faces, hands, and devices
under diverse conditions. Existing generic benchmarks do not fully capture such
fine-grained human--device interactions. To address this gap, we introduce the
FPI-Det, containing 22{,}879 images with synchronized annotations for faces and
phones across workplace, education, transportation, and public scenarios. The
dataset features extreme scale variation, frequent occlusions, and varied
capture conditions. We evaluate representative YOLO and DETR detectors,
providing baseline results and an analysis of performance across object sizes,
occlusion levels, and environments. Source code and dataset is available at
https://github.com/KvCgRv/FPI-Det.
☆ S-BEVLoc: BEV-based Self-supervised Framework for Large-scale LiDAR Global Localization
Chenghao Zhang, Lun Luo, Si-Yuan Cao, Xiaokai Bai, Yuncheng Jin, Zhu Yu, Beinan Yu, Yisen Wang, Hui-Liang Shen
LiDAR-based global localization is an essential component of simultaneous
localization and mapping (SLAM), which helps loop closure and re-localization.
Current approaches rely on ground-truth poses obtained from GPS or SLAM
odometry to supervise network training. Despite the great success of these
supervised approaches, substantial cost and effort are required for
high-precision ground-truth pose acquisition. In this work, we propose
S-BEVLoc, a novel self-supervised framework based on bird's-eye view (BEV) for
LiDAR global localization, which eliminates the need for ground-truth poses and
is highly scalable. We construct training triplets from single BEV images by
leveraging the known geographic distances between keypoint-centered BEV
patches. Convolutional neural network (CNN) is used to extract local features,
and NetVLAD is employed to aggregate global descriptors. Moreover, we introduce
SoftCos loss to enhance learning from the generated triplets. Experimental
results on the large-scale KITTI and NCLT datasets show that S-BEVLoc achieves
state-of-the-art performance in place recognition, loop closure, and global
localization tasks, while offering scalability that would require extra effort
for supervised approaches.
☆ SQAP-VLA: A Synergistic Quantization-Aware Pruning Framework for High-Performance Vision-Language-Action Models
Vision-Language-Action (VLA) models exhibit unprecedented capabilities for
embodied intelligence. However, their extensive computational and memory costs
hinder their practical deployment. Existing VLA compression and acceleration
approaches conduct quantization or token pruning in an ad-hoc manner but fail
to enable both for a holistic efficiency improvement due to an observed
incompatibility. This work introduces SQAP-VLA, the first structured,
training-free VLA inference acceleration framework that simultaneously enables
state-of-the-art quantization and token pruning. We overcome the
incompatibility by co-designing the quantization and token pruning pipeline,
where we propose new quantization-aware token pruning criteria that work on an
aggressively quantized model while improving the quantizer design to enhance
pruning effectiveness. When applied to standard VLA models, SQAP-VLA yields
significant gains in computational efficiency and inference speed while
successfully preserving core model performance, achieving a $\times$1.93
speedup and up to a 4.5\% average success rate enhancement compared to the
original model.
comment: 12 pages, 9 figures
☆ IRDFusion: Iterative Relation-Map Difference guided Feature Fusion for Multispectral Object Detection
Current multispectral object detection methods often retain extraneous
background or noise during feature fusion, limiting perceptual performance.To
address this, we propose an innovative feature fusion framework based on
cross-modal feature contrastive and screening strategy, diverging from
conventional approaches. The proposed method adaptively enhances salient
structures by fusing object-aware complementary cross-modal features while
suppressing shared background interference.Our solution centers on two novel,
specially designed modules: the Mutual Feature Refinement Module (MFRM) and the
Differential Feature Feedback Module (DFFM). The MFRM enhances intra- and
inter-modal feature representations by modeling their relationships, thereby
improving cross-modal alignment and discriminative power.Inspired by feedback
differential amplifiers, the DFFM dynamically computes inter-modal differential
features as guidance signals and feeds them back to the MFRM, enabling adaptive
fusion of complementary information while suppressing common-mode noise across
modalities. To enable robust feature learning, the MFRM and DFFM are integrated
into a unified framework, which is formally formulated as an Iterative
Relation-Map Differential Guided Feature Fusion mechanism, termed IRDFusion.
IRDFusion enables high-quality cross-modal fusion by progressively amplifying
salient relational signals through iterative feedback, while suppressing
feature noise, leading to significant performance gains.In extensive
experiments on FLIR, LLVIP and M$^3$FD datasets, IRDFusion achieves
state-of-the-art performance and consistently outperforms existing methods
across diverse challenging scenarios, demonstrating its robustness and
effectiveness. Code will be available at
https://github.com/61s61min/IRDFusion.git.
comment: 31 pages,6 pages, submitted on 3 Sep,2025
☆ Improvement of Human-Object Interaction Action Recognition Using Scene Information and Multi-Task Learning Approach
Recent graph convolutional neural networks (GCNs) have shown high performance
in the field of human action recognition by using human skeleton poses.
However, it fails to detect human-object interaction cases successfully due to
the lack of effective representation of the scene information and appropriate
learning architectures. In this context, we propose a methodology to utilize
human action recognition performance by considering fixed object information in
the environment and following a multi-task learning approach. In order to
evaluate the proposed method, we collected real data from public environments
and prepared our data set, which includes interaction classes of hands-on fixed
objects (e.g., ATM ticketing machines, check-in/out machines, etc.) and
non-interaction classes of walking and standing. The multi-task learning
approach, along with interaction area information, succeeds in recognizing the
studied interaction and non-interaction actions with an accuracy of 99.25%,
outperforming the accuracy of the base model using only human skeleton poses by
2.75%.
☆ Enhancing 3D Medical Image Understanding with Pretraining Aided by 2D Multimodal Large Language Models IEEE
Understanding 3D medical image volumes is critical in the medical field, yet
existing 3D medical convolution and transformer-based self-supervised learning
(SSL) methods often lack deep semantic comprehension. Recent advancements in
multimodal large language models (MLLMs) provide a promising approach to
enhance image understanding through text descriptions. To leverage these 2D
MLLMs for improved 3D medical image understanding, we propose Med3DInsight, a
novel pretraining framework that integrates 3D image encoders with 2D MLLMs via
a specially designed plane-slice-aware transformer module. Additionally, our
model employs a partial optimal transport based alignment, demonstrating
greater tolerance to noise introduced by potential noises in LLM-generated
content. Med3DInsight introduces a new paradigm for scalable multimodal 3D
medical representation learning without requiring human annotations. Extensive
experiments demonstrate our state-of-the-art performance on two downstream
tasks, i.e., segmentation and classification, across various public datasets
with CT and MRI modalities, outperforming current SSL methods. Med3DInsight can
be seamlessly integrated into existing 3D medical image understanding networks,
potentially enhancing their performance. Our source code, generated datasets,
and pre-trained models will be available at
https://github.com/Qybc/Med3DInsight.
comment: Accepted by IEEE Journal of Biomedical and Health Informatics (JBHI)
♻ ☆ MM-Prompt: Cross-Modal Prompt Tuning for Continual Visual Question Answering
Continual Visual Question Answering (CVQA) based on pre-trained models(PTMs)
has achieved promising progress by leveraging prompt tuning to enable continual
multi-modal learning. However, most existing methods adopt cross-modal prompt
isolation, constructing visual and textual prompts separately, which
exacerbates modality imbalance and leads to degraded performance over time. To
tackle this issue, we propose MM-Prompt, a novel framework incorporating
cross-modal prompt query and cross-modal prompt recovery. The former enables
balanced prompt selection by incorporating cross-modal signals during query
formation, while the latter promotes joint prompt reconstruction through
iterative cross-modal interactions, guided by an alignment loss to prevent
representational drift. Extensive experiments show that MM-Prompt surpasses
prior approaches in accuracy and knowledge retention, while maintaining
balanced modality engagement throughout continual learning.
♻ ☆ Deep Learning Framework for Early Detection of Pancreatic Cancer Using Multi-Modal Medical Imaging Analysis
Pacreatic ductal adenocarcinoma (PDAC) remains one of the most lethal forms
of cancer, with a five-year survival rate below 10% primarily due to late
detection. This research develops and validates a deep learning framework for
early PDAC detection through analysis of dual-modality imaging:
autofluorescence and second harmonic generation (SHG). We analyzed 40 unique
patient samples to create a specialized neural network capable of
distinguishing between normal, fibrotic, and cancerous tissue. Our methodology
evaluated six distinct deep learning architectures, comparing traditional
Convolutional Neural Networks (CNNs) with modern Vision Transformers (ViTs).
Through systematic experimentation, we identified and overcome significant
challenges in medical image analysis, including limited dataset size and class
imbalance. The final optimized framework, based on a modified ResNet
architecture with frozen pre-trained layers and class-weighted training,
achieved over 90% accuracy in cancer detection. This represents a significant
improvement over current manual analysis methods an demonstrates potential for
clinical deployment. This work establishes a robust pipeline for automated PDAC
detection that can augment pathologists' capabilities while providing a
foundation for future expansion to other cancer types. The developed
methodology also offers valuable insights for applying deep learning to
limited-size medical imaging datasets, a common challenge in clinical
applications.
comment: 21 pages, 17 figure
♻ ☆ VRAE: Vertical Residual Autoencoder for License Plate Denoising and Deblurring
In real-world traffic surveillance, vehicle images captured under adverse
weather, poor lighting, or high-speed motion often suffer from severe noise and
blur. Such degradations significantly reduce the accuracy of license plate
recognition systems, especially when the plate occupies only a small region
within the full vehicle image. Restoring these degraded images a fast realtime
manner is thus a crucial pre-processing step to enhance recognition
performance. In this work, we propose a Vertical Residual Autoencoder (VRAE)
architecture designed for the image enhancement task in traffic surveillance.
The method incorporates an enhancement strategy that employs an auxiliary
block, which injects input-aware features at each encoding stage to guide the
representation learning process, enabling better general information
preservation throughout the network compared to conventional autoencoders.
Experiments on a vehicle image dataset with visible license plates demonstrate
that our method consistently outperforms Autoencoder (AE), Generative
Adversarial Network (GAN), and Flow-Based (FB) approaches. Compared with AE at
the same depth, it improves PSNR by about 20%, reduces NMSE by around 50%, and
enhances SSIM by 1%, while requiring only a marginal increase of roughly 1% in
parameters.
♻ ☆ Improved GUI Grounding via Iterative Narrowing
Graphical User Interface (GUI) grounding plays a crucial role in enhancing
the capabilities of Vision-Language Model (VLM) agents. While general VLMs,
such as GPT-4V, demonstrate strong performance across various tasks, their
proficiency in GUI grounding remains suboptimal. Recent studies have focused on
fine-tuning these models specifically for zero-shot GUI grounding, yielding
significant improvements over baseline performance. We introduce a visual
prompting framework that employs an iterative narrowing mechanism to further
improve the performance of both general and fine-tuned models in GUI grounding.
For evaluation, we tested our method on a comprehensive benchmark comprising
various UI platforms and provided the code to reproduce our results.
comment: Code available at
https://github.com/ant-8/GUI-Grounding-via-Iterative-Narrowing
♻ ☆ Preprocessing Algorithm Leveraging Geometric Modeling for Scale Correction in Hyperspectral Images for Improved Unmixing Performance
Praveen Sumanasekara, Athulya Ratnayake, Buddhi Wijenayake, Keshawa Ratnayake, Roshan Godaliyadda, Parakrama Ekanayake, Vijitha Herath
Spectral variability significantly impacts the accuracy and convergence of
hyperspectral unmixing algorithms. Many methods address complex spectral
variability; yet large-scale distortions to the scale of the observed pixel
signatures due to topography, illumination, and shadowing remain a major
challenge. These variations often degrade unmixing performance and complicate
model fitting. Because of this, correcting these variations can offer
significant advantages in real-world GIS applications. In this paper, we
propose a novel preprocessing algorithm that corrects scale-induced spectral
variability prior to unmixing. By estimating and correcting these distortions
to the scale of the pixel signatures, the algorithm produces pixel signatures
with minimal distortions in scale. Since these distortions in scale (which
hinder the performance of many unmixing methods) are greatly minimized in the
output provided by the proposed method, the abundance estimation of the
unmixing algorithms is significantly improved. We present a rigorous
mathematical framework to describe and correct for scale variability and
provide extensive experimental validation of the proposed algorithm.
Furthermore, the algorithm's impact is evaluated across a wide range of
state-of-the-art unmixing methods on two synthetic and two real hyperspectral
datasets. The proposed preprocessing step consistently improves the performance
of these algorithms, achieving error reductions of around 50%, even for
algorithms specifically designed to handle spectral variability. This
demonstrates that scale correction acts as a complementary step, facilitating
more accurate unmixing with existing methods. The algorithm's generality,
consistent impact, and significant influence highlight its potential as a key
component in practical hyperspectral unmixing pipelines. The implementation
code will be made publicly available upon publication.
comment: 20 pages, 14 figures
♻ ☆ Expert-Guided Explainable Few-Shot Learning for Medical Image Diagnosis MICCAI
Medical image analysis often faces significant challenges due to limited
expert-annotated data, hindering both model generalization and clinical
adoption. We propose an expert-guided explainable few-shot learning framework
that integrates radiologist-provided regions of interest (ROIs) into model
training to simultaneously enhance classification performance and
interpretability. Leveraging Grad-CAM for spatial attention supervision, we
introduce an explanation loss based on Dice similarity to align model attention
with diagnostically relevant regions during training. This explanation loss is
jointly optimized with a standard prototypical network objective, encouraging
the model to focus on clinically meaningful features even under limited data
conditions. We evaluate our framework on two distinct datasets: BraTS (MRI) and
VinDr-CXR (Chest X-ray), achieving significant accuracy improvements from
77.09% to 83.61% on BraTS and from 54.33% to 73.29% on VinDr-CXR compared to
non-guided models. Grad-CAM visualizations further confirm that expert-guided
training consistently aligns attention with diagnostic regions, improving both
predictive reliability and clinical trustworthiness. Our findings demonstrate
the effectiveness of incorporating expert-guided attention supervision to
bridge the gap between performance and interpretability in few-shot medical
image diagnosis.
comment: Accepted for publication in the proceedings of MICCAI Workshop on
Data Engineering in Medical Imaging 2025
♻ ☆ GEMINUS: Dual-aware Global and Scene-Adaptive Mixture-of-Experts for End-to-End Autonomous Driving
End-to-end autonomous driving requires adaptive and robust handling of
complex and diverse traffic environments. However, prevalent single-mode
planning methods attempt to learn an overall policy while struggling to acquire
diversified driving skills to handle diverse scenarios. Therefore, this paper
proposes GEMINUS, a Mixture-of-Experts end-to-end autonomous driving framework
featuring a Global Expert and a Scene-Adaptive Experts Group, equipped with a
Dual-aware Router. Specifically, the Global Expert is trained on the overall
dataset, possessing robust performance. The Scene-Adaptive Experts are trained
on corresponding scene subsets, achieving adaptive performance. The Dual-aware
Router simultaneously considers scenario-level features and routing uncertainty
to dynamically activate expert modules. Through the effective coupling of the
Global Expert and the Scene-Adaptive Experts Group via the Dual-aware Router,
GEMINUS achieves both adaptability and robustness across diverse scenarios.
GEMINUS outperforms existing methods in the Bench2Drive closed-loop benchmark
and achieves state-of-the-art performance in Driving Score and Success Rate,
even with only monocular vision input. The code is available at
https://github.com/newbrains1/GEMINUS.
♻ ☆ Deep Learning-based Cross-modal Reconstruction of Vehicle Target from Sparse 3D SAR Image IEEE
Three-dimensional synthetic aperture radar (3D SAR) is an advanced active
microwave imaging technology widely utilized in remote sensing area. To achieve
high-resolution 3D imaging,3D SAR requires observations from multiple aspects
and altitude baselines surrounding the target. However, constrained flight
trajectories often lead to sparse observations, which degrade imaging quality,
particularly for anisotropic man-made small targets, such as vehicles and
aircraft. In the past, compressive sensing (CS) was the mainstream approach for
sparse 3D SAR image reconstruction. More recently, deep learning (DL) has
emerged as a powerful alternative, markedly boosting reconstruction quality and
efficiency. However, existing DL-based methods typically rely solely on
high-quality 3D SAR images as supervisory signals to train deep neural networks
(DNNs). This unimodal learning paradigm prevents the integration of
complementary information from other data modalities, which limits
reconstruction performance and reduces target discriminability due to the
inherent constraints of electromagnetic scattering. In this paper, we introduce
cross-modal learning and propose a Cross-Modal 3D-SAR Reconstruction Network
(CMAR-Net) for enhancing sparse 3D SAR images of vehicle targets by fusing
optical information. Leveraging cross-modal supervision from 2D optical images
and error propagation guaranteed by differentiable rendering, CMAR-Net achieves
efficient training and reconstructs sparse 3D SAR images, which are derived
from highly sparse-aspect observations, into visually structured 3D vehicle
images. Trained exclusively on simulated data, CMAR-Net exhibits robust
generalization to real-world data, outperforming state-of-the-art CS and DL
methods in structural accuracy within a large-scale parking lot experiment
involving numerous civilian vehicles, thereby demonstrating its strong
practical applicability.
comment: This work has been submitted to the IEEE for possible publication
♻ ☆ SV-DRR: High-Fidelity Novel View X-Ray Synthesis Using Diffusion Model MICCAI2025
X-ray imaging is a rapid and cost-effective tool for visualizing internal
human anatomy. While multi-view X-ray imaging provides complementary
information that enhances diagnosis, intervention, and education, acquiring
images from multiple angles increases radiation exposure and complicates
clinical workflows. To address these challenges, we propose a novel
view-conditioned diffusion model for synthesizing multi-view X-ray images from
a single view. Unlike prior methods, which are limited in angular range,
resolution, and image quality, our approach leverages the Diffusion Transformer
to preserve fine details and employs a weak-to-strong training strategy for
stable high-resolution image generation. Experimental results demonstrate that
our method generates higher-resolution outputs with improved control over
viewing angles. This capability has significant implications not only for
clinical applications but also for medical education and data extension,
enabling the creation of diverse, high-quality datasets for training and
analysis. Our code is available at GitHub.
comment: Accepted by MICCAI2025
♻ ☆ 3D and 4D World Modeling: A Survey
Lingdong Kong, Wesley Yang, Jianbiao Mei, Youquan Liu, Ao Liang, Dekai Zhu, Dongyue Lu, Wei Yin, Xiaotao Hu, Mingkai Jia, Junyuan Deng, Kaiwen Zhang, Yang Wu, Tianyi Yan, Shenyuan Gao, Song Wang, Linfeng Li, Liang Pan, Yong Liu, Jianke Zhu, Wei Tsang Ooi, Steven C. H. Hoi, Ziwei Liu
World modeling has become a cornerstone in AI research, enabling agents to
understand, represent, and predict the dynamic environments they inhabit. While
prior work largely emphasizes generative methods for 2D image and video data,
they overlook the rapidly growing body of work that leverages native 3D and 4D
representations such as RGB-D imagery, occupancy grids, and LiDAR point clouds
for large-scale scene modeling. At the same time, the absence of a standardized
definition and taxonomy for ``world models'' has led to fragmented and
sometimes inconsistent claims in the literature. This survey addresses these
gaps by presenting the first comprehensive review explicitly dedicated to 3D
and 4D world modeling and generation. We establish precise definitions,
introduce a structured taxonomy spanning video-based (VideoGen),
occupancy-based (OccGen), and LiDAR-based (LiDARGen) approaches, and
systematically summarize datasets and evaluation metrics tailored to 3D/4D
settings. We further discuss practical applications, identify open challenges,
and highlight promising research directions, aiming to provide a coherent and
foundational reference for advancing the field. A systematic summary of
existing literature is available at https://github.com/worldbench/survey
comment: Survey; 34 pages, 10 figures, 14 tables; GitHub Repo at
https://github.com/worldbench/survey
♻ ☆ Scaling Artificial Intelligence for Prostate Cancer Detection on MRI towards Organized Screening and Primary Diagnosis in a Global, Multiethnic Population (Study Protocol)
Anindo Saha, Joeran S. Bosma, Jasper J. Twilt, Alexander B. C. D. Ng, Aqua Asif, Kirti Magudia, Peder Larson, Qinglin Xie, Xiaodong Zhang, Chi Pham Minh, Samuel N. Gitau, Ivo G. Schoots, Martijn F. Boomsma, Renato Cuocolo, Nikolaos Papanikolaou, Daniele Regge, Derya Yakar, Mattijs Elschot, Jeroen Veltman, Baris Turkbey, Nancy A. Obuchowski, Jurgen J. Fütterer, Anwar R. Padhani, Hashim U. Ahmed, Tobias Nordström, Martin Eklund, Veeru Kasivisvanathan, Maarten de Rooij, Henkjan Huisman
In this intercontinental, confirmatory study, we include a retrospective
cohort of 22,481 MRI examinations (21,288 patients; 46 cities in 22 countries)
to train and externally validate the PI-CAI-2B model, i.e., an efficient,
next-generation iteration of the state-of-the-art AI system that was developed
for detecting Gleason grade group $\geq$2 prostate cancer on MRI during the
PI-CAI study. Of these examinations, 20,471 cases (19,278 patients; 26 cities
in 14 countries) from two EU Horizon projects (ProCAncer-I, COMFORT) and 12
independent centers based in Europe, North America, Asia and Africa, are used
for training and internal testing. Additionally, 2010 cases (2010 patients; 20
external cities in 12 countries) from population-based screening (STHLM3-MRI,
IP1-PROSTAGRAM trials) and primary diagnostic settings (PRIME trial) based in
Europe, North and South Americas, Asia and Australia, are used for external
testing. Primary endpoint is the proportion of AI-based assessments in
agreement with the standard of care diagnoses (i.e., clinical assessments made
by expert uropathologists on histopathology, if available, or at least two
expert urogenital radiologists in consensus; with access to patient history and
peer consultation) in the detection of Gleason grade group $\geq$2 prostate
cancer within the external testing cohorts. Our statistical analysis plan is
prespecified with a hypothesis of diagnostic interchangeability to the standard
of care at the PI-RADS $\geq$3 (primary diagnosis) or $\geq$4 (screening)
cut-off, considering an absolute margin of 0.05 and reader estimates derived
from the PI-CAI observer study (62 radiologists reading 400 cases). Secondary
measures comprise the area under the receiver operating characteristic curve
(AUROC) of the AI system stratified by imaging quality, patient age and patient
ethnicity to identify underlying biases (if any).
♻ ☆ Automatic infant 2D pose estimation from videos: comparing seven deep neural network methods
Automatic markerless estimation of infant posture and motion from ordinary
videos carries great potential for movement studies "in the wild", facilitating
understanding of motor development and massively increasing the chances of
early diagnosis of disorders. There is rapid development of human pose
estimation methods in computer vision thanks to advances in deep learning and
machine learning. However, these methods are trained on datasets that feature
adults in different contexts. This work tests and compares seven popular
methods (AlphaPose, DeepLabCut/DeeperCut, Detectron2, HRNet,
MediaPipe/BlazePose, OpenPose, and ViTPose) on videos of infants in supine
position and in more complex settings. Surprisingly, all methods except
DeepLabCut and MediaPipe have competitive performance without additional
finetuning, with ViTPose performing best. Next to standard performance metrics
(average precision and recall), we introduce errors expressed in the
neck-mid-hip (torso length) ratio and additionally study missed and redundant
detections, and the reliability of the internal confidence ratings of the
different methods, which are relevant for downstream tasks. Among the networks
with competitive performance, only AlphaPose could run close to real time (27
fps) on our machine. We provide documented Docker containers or instructions
for all the methods we used, our analysis scripts, and the processed data at
https://hub.docker.com/u/humanoidsctu and https://osf.io/x465b/.
comment: 38 pages, 8 figures, 22 tables
♻ ☆ Towards Scalable Training for Handwritten Mathematical Expression Recognition
Large foundation models have achieved significant performance gains through
scalable training on massive datasets. However, the field of
\textbf{H}andwritten \textbf{M}athematical \textbf{E}xpression
\textbf{R}ecognition (HMER) has been impeded by the scarcity of data, primarily
due to the arduous and costly process of manual annotation. To bridge this gap,
we propose a novel method integrating limited handwritten formulas with
large-scale LaTeX-rendered formulas by developing a scalable data engine to
generate complex and consistent LaTeX sequences. With this engine, we built the
largest formula dataset to date, termed \texttt{Tex80M}, comprising over 80
million high-quality training instances. Then we propose \texttt{TexTeller},
the first HMER model trained at scale, by mix-training \texttt{Tex80M} with a
relatively small HME dataset. The expansive training dataset and our refined
pipeline have equipped \texttt{TexTeller} with state-of-the-art (SOTA)
performance across nearly all benchmarks. To advance the field, we will openly
release our complete model, entire dataset, and full codebase, enabling further
research building upon our contributions.
♻ ☆ Focusing by Contrastive Attention: Enhancing VLMs' Visual Reasoning
Yuyao Ge, Shenghua Liu, Yiwei Wang, Lingrui Mei, Baolong Bi, Xuanshan Zhou, Jiayu Yao, Jiafeng Guo, Xueqi Cheng
Vision-Language Models (VLMs) have demonstrated remarkable success across
diverse visual tasks, yet their performance degrades in complex visual
environments. While existing enhancement approaches require additional
training, rely on external segmentation tools, or operate at coarse-grained
levels, they overlook the innate ability within VLMs. To bridge this gap, we
investigate VLMs' attention patterns and discover that: (1) visual complexity
strongly correlates with attention entropy, negatively impacting reasoning
performance; (2) attention progressively refines from global scanning in
shallow layers to focused convergence in deeper layers, with convergence degree
determined by visual complexity. (3) Theoretically, we prove that the contrast
of attention maps between general queries and task-specific queries enables the
decomposition of visual signal into semantic signals and visual noise
components. Building on these insights, we propose Contrastive Attention
Refinement for Visual Enhancement (CARVE), a training-free method that extracts
task-relevant visual signals through attention contrasting at the pixel level.
Extensive experiments demonstrate that CARVE consistently enhances performance,
achieving up to 75% improvement on open-source models. Our work provides
critical insights into the interplay between visual complexity and attention
mechanisms, offering an efficient pathway for improving visual reasoning with
contrasting attention.
♻ ☆ Bridging Simplicity and Sophistication using GLinear: A Novel Architecture for Enhanced Time Series Prediction
Time Series Forecasting (TSF) is an important application across many fields.
There is a debate about whether Transformers, despite being good at
understanding long sequences, struggle with preserving temporal relationships
in time series data. Recent research suggests that simpler linear models might
outperform or at least provide competitive performance compared to complex
Transformer-based models for TSF tasks. In this paper, we propose a novel
data-efficient architecture, \textit{Gaussian-activated Linear model
(GLinear)}, for multivariate TSF that exploits periodic patterns to provide
better accuracy. It achieves higher prediction accuracy while requiring less
historical data than other state-of-the-art linear predictors. Four different
datasets (ETTh1, Electricity, Traffic, and Weather) are used to evaluate the
performance of the proposed predictor. A performance comparison with
state-of-the-art linear architectures (such as NLinear, DLinear, and RLinear)
and transformer-based time series predictors (Autoformer) shows that the
GLinear, despite being data efficient, outperforms the existing architectures
in most cases of multivariate TSF while being competitive in others. We hope
that the proposed GLinear model opens new fronts of research and development of
simpler and more sophisticated architectures for data and computationally
efficient time-series analysis. The source code is publicly available on
GitHub.
comment: Submitted to Digital Signal Processing Journal
♻ ☆ A Lightweight Convolution and Vision Transformer integrated model with Multi-scale Self-attention Mechanism
Vision Transformer (ViT) has prevailed in computer vision tasks due to its
strong long-range dependency modelling ability. \textcolor{blue}{However, its
large model size and weak local feature modeling ability hinder its application
in real scenarios. To balance computation efficiency and performance in
downstream vision tasks, we propose an efficient ViT model with sparse
attention (dubbed SAEViT) and convolution blocks. Specifically, a Sparsely
Aggregated Attention (SAA) module has been proposed to perform adaptive sparse
sampling and recover the feature map via deconvolution operation,} which
significantly reduces the computational complexity of attention operations. In
addition, a Channel-Interactive Feed-Forward Network (CIFFN) layer is developed
to enhance inter-channel information exchange through feature decomposition and
redistribution, which mitigates the redundancy in traditional feed-forward
networks (FFN). Finally, a hierarchical pyramid structure with embedded
depth-wise separable convolutional blocks (DWSConv) is devised to further
strengthen convolutional features. Extensive experiments on mainstream datasets
show that SAEViT achieves Top-1 accuracies of 76.3\% and 79.6\% on the
ImageNet-1K classification task with only 0.8 GFLOPs and 1.3 GFLOPs,
respectively, demonstrating a lightweight solution for fundamental vision
tasks.
♻ ☆ Enhancing Automatic Modulation Recognition With a Reconstruction-Driven Vision Transformer Under Limited Labels
Hossein Ahmadi, Banafsheh Saffari, Sajjad Emdadi Mahdimahalleh, Mohammad Esmaeil Safari, Aria Ahmadi
Automatic modulation recognition (AMR) is critical for cognitive radio,
spectrum monitoring, and secure wireless communication. However, existing
solutions often rely on large labeled datasets or multi-stage training
pipelines, which limit scalability and generalization in practice. We propose a
unified Vision Transformer (ViT) framework that integrates supervised,
self-supervised, and reconstruction objectives. The model combines a ViT
encoder, a lightweight convolutional decoder, and a linear classifier; the
reconstruction branch maps augmented signals back to their originals, anchoring
the encoder to fine-grained I/Q structure. This strategy promotes robust,
discriminative feature learning during pretraining, while partial label
supervision in fine-tuning enables effective classification with limited
labels. On the RML2018.01A dataset, our approach outperforms supervised CNN and
ViT baselines in low-label regimes, approaches ResNet-level accuracy with only
15-20% labeled data, and maintains strong performance across varying SNR
levels. Overall, the framework provides a simple, generalizable, and
label-efficient solution for AMR.
♻ ☆ TinyDef-DETR: A DETR-based Framework for Defect Detection in Transmission Lines from UAV Imagery
Automated defect detection from UAV imagery of transmission lines is a
challenging task due to the small size, ambiguity, and complex backgrounds of
defects. This paper proposes TinyDef-DETR, a DETR-based framework designed to
achieve accurate and efficient detection of transmission line defects from
UAV-acquired images. The model integrates four major components: an
edge-enhanced ResNet backbone to strengthen boundary-sensitive representations,
a stride-free space-to-depth module to enable detail-preserving downsampling, a
cross-stage dual-domain multi-scale attention mechanism to jointly model global
context and local cues, and a Focaler-Wise-SIoU regression loss to improve the
localization of small and difficult targets. Together, these designs
effectively mitigate the limitations of conventional detectors. Extensive
experiments on both public and real-world datasets demonstrate that
TinyDef-DETR achieves superior detection performance and strong generalization
capability, while maintaining modest computational overhead. The accuracy and
efficiency of TinyDef-DETR make it a suitable method for UAV-based transmission
line defect detection, particularly in scenarios involving small and ambiguous
targets.
♻ ☆ Sigma Flows for Image and Data Labeling and Learning Structured Prediction
This paper introduces the sigma flow model for the prediction of structured
labelings of data observed on Riemannian manifolds, including Euclidean image
domains as special case. The approach combines the Laplace-Beltrami framework
for image denoising and enhancement, introduced by Sochen, Kimmel and Malladi
about 25 years ago, and the assignment flow approach introduced and studied by
the authors.
The sigma flow arises as Riemannian gradient flow of generalized harmonic
energies and thus is governed by a nonlinear geometric PDE which determines a
harmonic map from a closed Riemannian domain manifold to a statistical
manifold, equipped with the Fisher-Rao metric from information geometry. A
specific ingredient of the sigma flow is the mutual dependency of the
Riemannian metric of the domain manifold on the evolving state. This makes the
approach amenable to machine learning in a specific way, by realizing this
dependency through a mapping with compact time-variant parametrization that can
be learned from data. Proof of concept experiments demonstrate the expressivity
of the sigma flow model and prediction performance.
Structural similarities to transformer network architectures and networks
generated by the geometric integration of sigma flows are pointed out, which
highlights the connection to deep learning and, conversely, may stimulate the
use of geometric design principles for structured prediction in other areas of
scientific machine learning.
comment: 51 pages, revised experimental section
♻ ☆ ABS-Mamba: SAM2-Driven Bidirectional Spiral Mamba Network for Medical Image Translation MICCAI 2025
Accurate multi-modal medical image translation requires ha-rmonizing global
anatomical semantics and local structural fidelity, a challenge complicated by
intermodality information loss and structural distortion. We propose ABS-Mamba,
a novel architecture integrating the Segment Anything Model 2 (SAM2) for
organ-aware semantic representation, specialized convolutional neural networks
(CNNs) for preserving modality-specific edge and texture details, and Mamba's
selective state-space modeling for efficient long- and short-range feature
dependencies. Structurally, our dual-resolution framework leverages SAM2's
image encoder to capture organ-scale semantics from high-resolution inputs,
while a parallel CNNs branch extracts fine-grained local features. The Robust
Feature Fusion Network (RFFN) integrates these epresentations, and the
Bidirectional Mamba Residual Network (BMRN) models spatial dependencies using
spiral scanning and bidirectional state-space dynamics. A three-stage skip
fusion decoder enhances edge and texture fidelity. We employ Efficient Low-Rank
Adaptation (LoRA+) fine-tuning to enable precise domain specialization while
maintaining the foundational capabilities of the pre-trained components.
Extensive experimental validation on the SynthRAD2023 and BraTS2019 datasets
demonstrates that ABS-Mamba outperforms state-of-the-art methods, delivering
high-fidelity cross-modal synthesis that preserves anatomical semantics and
structural details to enhance diagnostic accuracy in clinical applications. The
code is available at https://github.com/gatina-yone/ABS-Mamba
comment: MICCAI 2025(under view)
♻ ☆ Hallo4: High-Fidelity Dynamic Portrait Animation via Direct Preference Optimization
Jiahao Cui, Yan Chen, Mingwang Xu, Hanlin Shang, Yuxuan Chen, Yun Zhan, Zilong Dong, Yao Yao, Jingdong Wang, Siyu Zhu
Generating highly dynamic and photorealistic portrait animations driven by
audio and skeletal motion remains challenging due to the need for precise lip
synchronization, natural facial expressions, and high-fidelity body motion
dynamics. We propose a human-preference-aligned diffusion framework that
addresses these challenges through two key innovations. First, we introduce
direct preference optimization tailored for human-centric animation, leveraging
a curated dataset of human preferences to align generated outputs with
perceptual metrics for portrait motion-video alignment and naturalness of
expression. Second, the proposed temporal motion modulation resolves
spatiotemporal resolution mismatches by reshaping motion conditions into
dimensionally aligned latent features through temporal channel redistribution
and proportional feature expansion, preserving the fidelity of high-frequency
motion details in diffusion-based synthesis. The proposed mechanism is
complementary to existing UNet and DiT-based portrait diffusion approaches, and
experiments demonstrate obvious improvements in lip-audio synchronization,
expression vividness, body motion coherence over baseline methods, alongside
notable gains in human preference metrics. Our model and source code can be
found at: https://github.com/xyz123xyz456/hallo4.
♻ ☆ Adapting Vision-Language Models for Neutrino Event Classification in High-Energy Physics
Recent advances in Large Language Models (LLMs) have demonstrated their
remarkable capacity to process and reason over structured and unstructured data
modalities beyond natural language. In this work, we explore the applications
of Vision Language Models (VLMs), specifically a fine-tuned variant of LLaMa
3.2, to the task of identifying neutrino interactions in pixelated detector
data from high-energy physics (HEP) experiments. We benchmark this model
against a state-of-the-art convolutional neural network (CNN) architecture,
similar to those used in the NOvA and DUNE experiments, which have achieved
high efficiency and purity in classifying electron and muon neutrino events.
Our evaluation considers both the classification performance and
interpretability of the model predictions. We find that VLMs can outperform
CNNs, while also providing greater flexibility in integrating auxiliary textual
or semantic information and offering more interpretable, reasoning-based
predictions. This work highlights the potential of VLMs as a general-purpose
backbone for physics event classification, due to their high performance,
interpretability, and generalizability, which opens new avenues for integrating
multimodal reasoning in experimental neutrino physics.
♻ ☆ Zero-shot 3D-Aware Trajectory-Guided image-to-video generation via Test-Time Training
Trajectory-Guided image-to-video (I2V) generation aims to synthesize videos
that adhere to user-specified motion instructions. Existing methods typically
rely on computationally expensive fine-tuning on scarce annotated datasets.
Although some zero-shot methods attempt to trajectory control in the latent
space, they may yield unrealistic motion by neglecting 3D perspective and
creating a misalignment between the manipulated latents and the network's noise
predictions. To address these challenges, we introduce Zo3T, a novel zero-shot
test-time-training framework for trajectory-guided generation with three core
innovations: First, we incorporate a 3D-Aware Kinematic Projection, leveraging
inferring scene depth to derive perspective-correct affine transformations for
target regions. Second, we introduce Trajectory-Guided Test-Time LoRA, a
mechanism that dynamically injects and optimizes ephemeral LoRA adapters into
the denoising network alongside the latent state. Driven by a regional feature
consistency loss, this co-adaptation effectively enforces motion constraints
while allowing the pre-trained model to locally adapt its internal
representations to the manipulated latent, thereby ensuring generative fidelity
and on-manifold adherence. Finally, we develop Guidance Field Rectification,
which refines the denoising evolutionary path by optimizing the conditional
guidance field through a one-step lookahead strategy, ensuring efficient
generative progression towards the target trajectory. Zo3T significantly
enhances 3D realism and motion accuracy in trajectory-controlled I2V
generation, demonstrating superior performance over existing training-based and
zero-shot approaches.
♻ ☆ Shaken, Not Stirred: A Novel Dataset for Visual Understanding of Glasses in Human-Robot Bartending Tasks IROS
Datasets for object detection often do not account for enough variety of
glasses, due to their transparent and reflective properties. Specifically,
open-vocabulary object detectors, widely used in embodied robotic agents, fail
to distinguish subclasses of glasses. This scientific gap poses an issue for
robotic applications that suffer from accumulating errors between detection,
planning, and action execution. This paper introduces a novel method for
acquiring real-world data from RGB-D sensors that minimizes human effort. We
propose an auto-labeling pipeline that generates labels for all the acquired
frames based on the depth measurements. We provide a novel real-world glass
object dataset GlassNICOLDataset that was collected on the Neuro-Inspired
COLlaborator (NICOL), a humanoid robot platform. The dataset consists of 7850
images recorded from five different cameras. We show that our trained baseline
model outperforms state-of-the-art open-vocabulary approaches. In addition, we
deploy our baseline model in an embodied agent approach to the NICOL platform,
on which it achieves a success rate of 81% in a human-robot bartending
scenario.
comment: Submitted and Accepted for Presentation at the IEEE/RSJ International
Conference on Intelligent Robots and Systems (IROS) 2025
♻ ☆ Robix: A Unified Model for Robot Interaction, Reasoning and Planning
Huang Fang, Mengxi Zhang, Heng Dong, Wei Li, Zixuan Wang, Qifeng Zhang, Xueyun Tian, Yucheng Hu, Hang Li
We introduce Robix, a unified model that integrates robot reasoning, task
planning, and natural language interaction within a single vision-language
architecture. Acting as the high-level cognitive layer in a hierarchical robot
system, Robix dynamically generates atomic commands for the low-level
controller and verbal responses for human interaction, enabling robots to
follow complex instructions, plan long-horizon tasks, and interact naturally
with human within an end-to-end framework. Robix further introduces novel
capabilities such as proactive dialogue, real-time interruption handling, and
context-aware commonsense reasoning during task execution. At its core, Robix
leverages chain-of-thought reasoning and adopts a three-stage training
strategy: (1) continued pretraining to enhance foundational embodied reasoning
abilities including 3D spatial understanding, visual grounding, and
task-centric reasoning; (2) supervised finetuning to model human-robot
interaction and task planning as a unified reasoning-action sequence; and (3)
reinforcement learning to improve reasoning-action consistency and long-horizon
task coherence. Extensive experiments demonstrate that Robix outperforms both
open-source and commercial baselines (e.g., GPT-4o and Gemini 2.5 Pro) in
interactive task execution, demonstrating strong generalization across diverse
instruction types (e.g., open-ended, multi-stage, constrained, invalid, and
interrupted) and various user-involved tasks such as table bussing, grocery
shopping, and dietary filtering.
comment: Tech report. Project page: https://robix-seed.github.io/robix/
♻ ☆ VFlowOpt: A Token Pruning Framework for LMMs with Visual Information Flow-Guided Optimization ICCV 2025
Large Multimodal Models (LMMs) excel in visual-language tasks by leveraging
numerous visual tokens for fine-grained visual information, but this token
redundancy results in significant computational costs. Previous research aimed
at reducing visual tokens during inference typically leverages importance maps
derived from attention scores among vision-only tokens or vision-language
tokens to prune tokens across one or multiple pruning stages. Despite this
progress, pruning frameworks and strategies remain simplistic and
insufficiently explored, often resulting in substantial performance
degradation. In this paper, we propose VFlowOpt, a token pruning framework that
introduces an importance map derivation process and a progressive pruning
module with a recycling mechanism. The hyperparameters of its pruning strategy
are further optimized by a visual information flow-guided method. Specifically,
we compute an importance map for image tokens based on their attention-derived
context relevance and patch-level information entropy. We then decide which
tokens to retain or prune and aggregate the pruned ones as recycled tokens to
avoid potential information loss. Finally, we apply a visual information
flow-guided method that regards the last token in the LMM as the most
representative signal of text-visual interactions. This method minimizes the
discrepancy between token representations in LMMs with and without pruning,
thereby enabling superior pruning strategies tailored to different LMMs.
Experiments demonstrate that VFlowOpt can prune 90% of visual tokens while
maintaining comparable performance, leading to an 89% reduction in KV-Cache
memory and 3.8 times faster inference.
comment: Accepted by ICCV 2025
♻ ☆ Improving Alignment in LVLMs with Debiased Self-Judgment EMNLP 2025
The rapid advancements in Large Language Models (LLMs) and Large
Visual-Language Models (LVLMs) have opened up new opportunities for integrating
visual and linguistic modalities. However, effectively aligning these
modalities remains challenging, often leading to hallucinations--where
generated outputs are not grounded in the visual input--and raising safety
concerns across various domains. Existing alignment methods, such as
instruction tuning and preference tuning, often rely on external datasets,
human annotations, or complex post-processing, which limit scalability and
increase costs. To address these challenges, we propose a novel approach that
generates the debiased self-judgment score, a self-evaluation metric created
internally by the model without relying on external resources. This enables the
model to autonomously improve alignment. Our method enhances both decoding
strategies and preference tuning processes, resulting in reduced
hallucinations, enhanced safety, and improved overall capability. Empirical
results show that our approach significantly outperforms traditional methods,
offering a more effective solution for aligning LVLMs.
comment: EMNLP 2025 Findings
♻ ☆ LiDAR-BIND-T: Improved and Temporally Consistent Sensor Modality Translation and Fusion for Robotic Applications
This paper extends LiDAR-BIND, a modular multi-modal fusion framework that
binds heterogeneous sensors (radar, sonar) to a LiDAR-defined latent space,
with mechanisms that explicitly enforce temporal consistency. We introduce
three contributions: (i) temporal embedding similarity that aligns consecutive
latent representations, (ii) a motion-aligned transformation loss that matches
displacement between predictions and ground truth LiDAR, and (iii) windowed
temporal fusion using a specialised temporal module. We further update the
model architecture to better preserve spatial structure. Evaluations on
radar/sonar-to-LiDAR translation demonstrate improved temporal and spatial
coherence, yielding lower absolute trajectory error and better occupancy map
accuracy in Cartographer-based SLAM (Simultaneous Localisation and Mapping). We
propose different metrics based on the Fr\'echet Video Motion Distance (FVMD)
and a correlation-peak distance metric providing practical temporal quality
indicators to evaluate SLAM performance. The proposed temporal LiDAR-BIND, or
LiDAR-BIND-T, maintains plug-and-play modality fusion while substantially
enhancing temporal stability, resulting in improved robustness and performance
for downstream SLAM.
♻ ☆ Total Disentanglement of Font Images into Style and Character Class Features
In this paper, we demonstrate a total disentanglement of font images. Total
disentanglement is a neural network-based method for decomposing each font
image nonlinearly and completely into its style and content (i.e., character
class) features. It uses a simple but careful training procedure to extract the
common style feature from all `A'-`Z' images in the same font and the common
content feature from all `A' (or another class) images in different fonts.
These disentangled features guarantee the reconstruction of the original font
image. Various experiments have been conducted to understand the performance of
total disentanglement. First, it is demonstrated that total disentanglement is
achievable with very high accuracy; this is experimental proof of the
long-standing open question, ``Does `A'-ness exist?'' Hofstadter (1985).
Second, it is demonstrated that the disentangled features produced by total
disentanglement apply to a variety of tasks, including font recognition,
character recognition, and one-shot font image generation. Code is available
here: https://github.com/uchidalab/total_disentanglement
♻ ☆ MESH -- Understanding Videos Like Human: Measuring Hallucinations in Large Video Models
Garry Yang, Zizhe Chen, Man Hon Wong, Haoyu Lei, Yongqiang Chen, Zhenguo Li, Kaiwen Zhou, James Cheng
Large Video Models (LVMs) build on the semantic capabilities of Large
Language Models (LLMs) and vision modules by integrating temporal information
to better understand dynamic video content. Despite their progress, LVMs are
prone to hallucinations-producing inaccurate or irrelevant descriptions.
Current benchmarks for video hallucination depend heavily on manual
categorization of video content, neglecting the perception-based processes
through which humans naturally interpret videos. We introduce MESH, a benchmark
designed to evaluate hallucinations in LVMs systematically. MESH uses a
Question-Answering framework with binary and multi-choice formats incorporating
target and trap instances. It follows a bottom-up approach, evaluating basic
objects, coarse-to-fine subject features, and subject-action pairs, aligning
with human video understanding. We demonstrate that MESH offers an effective
and comprehensive approach for identifying hallucinations in videos. Our
evaluations show that while LVMs excel at recognizing basic objects and
features, their susceptibility to hallucinations increases markedly when
handling fine details or aligning multiple actions involving various subjects
in longer videos.
♻ ☆ Towards Reliable Medical Image Segmentation by Modeling Evidential Calibrated Uncertainty IEEE
Ke Zou, Yidi Chen, Ling Huang, Xuedong Yuan, Xiaojing Shen, Meng Wang, Rick Siow Mong Goh, Yong Liu, Huazhu Fu
Medical image segmentation is critical for disease diagnosis and treatment
assessment. However, concerns regarding the reliability of segmentation regions
persist among clinicians, mainly attributed to the absence of confidence
assessment, robustness, and calibration to accuracy. To address this, we
introduce DEviS, an easily implementable foundational model that seamlessly
integrates into various medical image segmentation networks. DEviS not only
enhances the calibration and robustness of baseline segmentation accuracy but
also provides high-efficiency uncertainty estimation for reliable predictions.
By leveraging subjective logic theory, we explicitly model probability and
uncertainty for medical image segmentation. Here, the Dirichlet distribution
parameterizes the distribution of probabilities for different classes of the
segmentation results. To generate calibrated predictions and uncertainty, we
develop a trainable calibrated uncertainty penalty. Furthermore, DEviS
incorporates an uncertainty-aware filtering module, which designs the metric of
uncertainty-calibrated error to filter out-of-distribution data. We conducted
validation studies on publicly available datasets, including ISIC2018,
KiTS2021, LiTS2017, and BraTS2019, to assess the accuracy and robustness of
different backbone segmentation models enhanced by DEviS, as well as the
efficiency and reliability of uncertainty estimation.
comment: 14 pages, 8 figures, accepted by IEEE Transactions on Cybernetics
♻ ☆ S$^2$-Guidance: Stochastic Self Guidance for Training-Free Enhancement of Diffusion Models
Chubin Chen, Jiashu Zhu, Xiaokun Feng, Nisha Huang, Meiqi Wu, Fangyuan Mao, Jiahong Wu, Xiangxiang Chu, Xiu Li
Classifier-free Guidance (CFG) is a widely used technique in modern diffusion
models for enhancing sample quality and prompt adherence. However, through an
empirical analysis on Gaussian mixture modeling with a closed-form solution, we
observe a discrepancy between the suboptimal results produced by CFG and the
ground truth. The model's excessive reliance on these suboptimal predictions
often leads to semantic incoherence and low-quality outputs. To address this
issue, we first empirically demonstrate that the model's suboptimal predictions
can be effectively refined using sub-networks of the model itself. Building on
this insight, we propose S^2-Guidance, a novel method that leverages stochastic
block-dropping during the forward process to construct stochastic sub-networks,
effectively guiding the model away from potential low-quality predictions and
toward high-quality outputs. Extensive qualitative and quantitative experiments
on text-to-image and text-to-video generation tasks demonstrate that
S^2-Guidance delivers superior performance, consistently surpassing CFG and
other advanced guidance strategies. Our code will be released.
♻ ☆ TESSER: Transfer-Enhancing Adversarial Attacks from Vision Transformers via Spectral and Semantic Regularization
Adversarial transferability remains a critical challenge in evaluating the
robustness of deep neural networks. In security-critical applications,
transferability enables black-box attacks without access to model internals,
making it a key concern for real-world adversarial threat assessment. While
Vision Transformers (ViTs) have demonstrated strong adversarial performance,
existing attacks often fail to transfer effectively across architectures,
especially from ViTs to Convolutional Neural Networks (CNNs) or hybrid models.
In this paper, we introduce \textbf{TESSER} -- a novel adversarial attack
framework that enhances transferability via two key strategies: (1)
\textit{Feature-Sensitive Gradient Scaling (FSGS)}, which modulates gradients
based on token-wise importance derived from intermediate feature activations,
and (2) \textit{Spectral Smoothness Regularization (SSR)}, which suppresses
high-frequency noise in perturbations using a differentiable Gaussian prior.
These components work in tandem to generate perturbations that are both
semantically meaningful and spectrally smooth. Extensive experiments on
ImageNet across 12 diverse architectures demonstrate that TESSER achieves
+10.9\% higher attack succes rate (ASR) on CNNs and +7.2\% on ViTs compared to
the state-of-the-art Adaptive Token Tuning (ATT) method. Moreover, TESSER
significantly improves robustness against defended models, achieving 53.55\%
ASR on adversarially trained CNNs. Qualitative analysis shows strong alignment
between TESSER's perturbations and salient visual regions identified via
Grad-CAM, while frequency-domain analysis reveals a 12\% reduction in
high-frequency energy, confirming the effectiveness of spectral regularization.
♻ ☆ UnsafeBench: Benchmarking Image Safety Classifiers on Real-World and AI-Generated Images CCS
With the advent of text-to-image models and concerns about their misuse,
developers are increasingly relying on image safety classifiers to moderate
their generated unsafe images. Yet, the performance of current image safety
classifiers remains unknown for both real-world and AI-generated images. In
this work, we propose UnsafeBench, a benchmarking framework that evaluates the
effectiveness and robustness of image safety classifiers, with a particular
focus on the impact of AI-generated images on their performance. First, we
curate a large dataset of 10K real-world and AI-generated images that are
annotated as safe or unsafe based on a set of 11 unsafe categories of images
(sexual, violent, hateful, etc.). Then, we evaluate the effectiveness and
robustness of five popular image safety classifiers, as well as three
classifiers that are powered by general-purpose visual language models. Our
assessment indicates that existing image safety classifiers are not
comprehensive and effective enough to mitigate the multifaceted problem of
unsafe images. Also, there exists a distribution shift between real-world and
AI-generated images in image qualities, styles, and layouts, leading to
degraded effectiveness and robustness. Motivated by these findings, we build a
comprehensive image moderation tool called PerspectiveVision, which improves
the effectiveness and robustness of existing classifiers, especially on
AI-generated images. UnsafeBench and PerspectiveVision can aid the research
community in better understanding the landscape of image safety classification
in the era of generative AI.
comment: To Appear in the ACM Conference on Computer and Communications
Security (CCS), October 13, 2025
♻ ☆ Deep Learning-Based Rock Particulate Classification Using Attention-Enhanced ConvNeXt
Accurate classification of rock sizes is a vital component in geotechnical
engineering, mining, and resource management, where precise estimation
influences operational efficiency and safety. In this paper, we propose an
enhanced deep learning model based on the ConvNeXt architecture, augmented with
both self-attention and channel attention mechanisms. Building upon the
foundation of ConvNext, our proposed model, termed CNSCA, introduces
self-attention to capture long-range spatial dependencies and channel attention
to emphasize informative feature channels. This hybrid design enables the model
to effectively capture both fine-grained local patterns and broader contextual
relationships within rock imagery, leading to improved classification accuracy
and robustness. We evaluate our model on a rock size classification dataset and
compare it against three strong baseline. The results demonstrate that the
incorporation of attention mechanisms significantly enhances the models
capability for fine-grained classification tasks involving natural textures
like rocks.
comment: The paper has been withdrawn by the authors to accommodate
substantial revisions requested by a co-author. A revised version will be
submitted
♻ ☆ JAX-IK: Real-Time Inverse Kinematics for Generating Multi-Constrained Movements of Virtual Human Characters
Generating accurate and realistic virtual human movements in real-time is of
high importance for a variety of applications in computer graphics, interactive
virtual environments, robotics, and biomechanics. This paper introduces a novel
real-time inverse kinematics (IK) solver specifically designed for realistic
human-like movement generation. Leveraging the automatic differentiation and
just-in-time compilation of TensorFlow, the proposed solver efficiently handles
complex articulated human skeletons with high degrees of freedom. By treating
forward and inverse kinematics as differentiable operations, our method
effectively addresses common challenges such as error accumulation and
complicated joint limits in multi-constrained problems, which are critical for
realistic human motion modeling. We demonstrate the solver's effectiveness on
the SMPLX human skeleton model, evaluating its performance against widely used
iterative-based IK algorithms, like Cyclic Coordinate Descent (CCD), FABRIK,
and the nonlinear optimization algorithm IPOPT. Our experiments cover both
simple end-effector tasks and sophisticated, multi-constrained problems with
realistic joint limits. Results indicate that our IK solver achieves real-time
performance, exhibiting rapid convergence, minimal computational overhead per
iteration, and improved success rates compared to existing methods. The project
code is available at https://github.com/hvoss-techfak/JAX-IK
♻ ☆ IDEATOR: Jailbreaking and Benchmarking Large Vision-Language Models Using Themselves
Ruofan Wang, Juncheng Li, Yixu Wang, Bo Wang, Xiaosen Wang, Yan Teng, Yingchun Wang, Xingjun Ma, Yu-Gang Jiang
As large Vision-Language Models (VLMs) gain prominence, ensuring their safe
deployment has become critical. Recent studies have explored VLM robustness
against jailbreak attacks-techniques that exploit model vulnerabilities to
elicit harmful outputs. However, the limited availability of diverse multimodal
data has constrained current approaches to rely heavily on adversarial or
manually crafted images derived from harmful text datasets, which often lack
effectiveness and diversity across different contexts. In this paper, we
propose IDEATOR, a novel jailbreak method that autonomously generates malicious
image-text pairs for black-box jailbreak attacks. IDEATOR is grounded in the
insight that VLMs themselves could serve as powerful red team models for
generating multimodal jailbreak prompts. Specifically, IDEATOR leverages a VLM
to create targeted jailbreak texts and pairs them with jailbreak images
generated by a state-of-the-art diffusion model. Extensive experiments
demonstrate IDEATOR's high effectiveness and transferability, achieving a 94%
attack success rate (ASR) in jailbreaking MiniGPT-4 with an average of only
5.34 queries, and high ASRs of 82%, 88%, and 75% when transferred to LLaVA,
InstructBLIP, and Chameleon, respectively. Building on IDEATOR's strong
transferability and automated process, we introduce the VLJailbreakBench, a
safety benchmark comprising 3,654 multimodal jailbreak samples. Our benchmark
results on 11 recently released VLMs reveal significant gaps in safety
alignment. For instance, our challenge set achieves ASRs of 46.31% on GPT-4o
and 19.65% on Claude-3.5-Sonnet, underscoring the urgent need for stronger
defenses.VLJailbreakBench is publicly available at
https://roywang021.github.io/VLJailbreakBench.
♻ ☆ Uncertainty-aware Diffusion and Reinforcement Learning for Joint Plane Localization and Anomaly Diagnosis in 3D Ultrasound MICCAI 2025
Congenital uterine anomalies (CUAs) can lead to infertility, miscarriage,
preterm birth, and an increased risk of pregnancy complications. Compared to
traditional 2D ultrasound (US), 3D US can reconstruct the coronal plane,
providing a clear visualization of the uterine morphology for assessing CUAs
accurately. In this paper, we propose an intelligent system for simultaneous
automated plane localization and CUA diagnosis. Our highlights are: 1) we
develop a denoising diffusion model with local (plane) and global (volume/text)
guidance, using an adaptive weighting strategy to optimize attention allocation
to different conditions; 2) we introduce a reinforcement learning-based
framework with unsupervised rewards to extract the key slice summary from
redundant sequences, fully integrating information across multiple planes to
reduce learning difficulty; 3) we provide text-driven uncertainty modeling for
coarse prediction, and leverage it to adjust the classification probability for
overall performance improvement. Extensive experiments on a large 3D uterine US
dataset show the efficacy of our method, in terms of plane localization and CUA
diagnosis. Code is available at https://github.com/yuhoo0302/CUA-US.
comment: Accepted by MICCAI 2025;10 pages, 3 figures
♻ ☆ Glo-UMF: A Unified Multi-model Framework for Automated Morphometry of Glomerular Ultrastructural Characterization
Zhentai Zhang, Danyi Weng, Guibin Zhang, Xiang Chen, Kaixing Long, Jian Geng, Yanmeng Lu, Lei Zhang, Zhitao Zhou, Lei Cao
Background and Objective: To address the inability of single-model
architectures to perform simultaneous analysis of complex glomerular
ultrastructures, we developed Glo-UMF, a unified multi-model framework
integrating segmentation, classification, and detection to systematically
quantify key ultrastructural features. Methods: Glo-UMF decouples
quantification tasks by constructing three dedicated deep models: an
ultrastructure segmentation model, a glomerular filtration barrier (GFB) region
classification model, and an electron-dense deposits (EDD) detection model.
Their outputs are integrated through a post-processing workflow with adaptive
GFB cropping and measurement location screening, enhancing measurement
reliability and providing comprehensive quantitative results that overcome the
limitations of traditional grading. Results: Trained on 372 electron microscopy
images, Glo-UMF enables simultaneous quantification of glomerular basement
membrane (GBM) thickness, the degree of foot process effacement (FPE), and EDD
location. In 115 test cases spanning 9 renal pathological types, the automated
quantification results showed strong agreement with pathological reports, with
an average processing time of 4.23$\pm$0.48 seconds per case on a CPU
environment. Conclusions: The modular design of Glo-UMF allows for flexible
extensibility, supporting the joint quantification of multiple features. This
framework ensures robust generalization and clinical applicability,
demonstrating significant potential as an efficient auxiliary tool in
glomerular pathological analysis.
comment: 17 pages, 6 figures
♻ ☆ V-HOP: Visuo-Haptic 6D Object Pose Tracking
Humans naturally integrate vision and haptics for robust object perception
during manipulation. The loss of either modality significantly degrades
performance. Inspired by this multisensory integration, prior object pose
estimation research has attempted to combine visual and haptic/tactile
feedback. Although these works demonstrate improvements in controlled
environments or synthetic datasets, they often underperform vision-only
approaches in real-world settings due to poor generalization across diverse
grippers, sensor layouts, or sim-to-real environments. Furthermore, they
typically estimate the object pose for each frame independently, resulting in
less coherent tracking over sequences in real-world deployments. To address
these limitations, we introduce a novel unified haptic representation that
effectively handles multiple gripper embodiments. Building on this
representation, we introduce a new visuo-haptic transformer-based object pose
tracker that seamlessly integrates visual and haptic input. We validate our
framework in our dataset and the Feelsight dataset, demonstrating significant
performance improvement on challenging sequences. Notably, our method achieves
superior generalization and robustness across novel embodiments, objects, and
sensor types (both taxel-based and vision-based tactile sensors). In real-world
experiments, we demonstrate that our approach outperforms state-of-the-art
visual trackers by a large margin. We further show that we can achieve precise
manipulation tasks by incorporating our real-time object tracking result into
motion plans, underscoring the advantages of visuo-haptic perception. Project
website: https://ivl.cs.brown.edu/research/v-hop
comment: Accepted by RSS 2025
♻ ☆ Bidirectional Sparse Attention for Faster Video Diffusion Training
Video diffusion Transformer (DiT) models excel in generative quality but hit
major computational bottlenecks when producing high-resolution, long-duration
videos. The quadratic complexity of full attention leads to prohibitively high
training and inference costs. Full attention inefficiency stems from two key
challenges: excessive computation due to the inherent sparsity of Queries and
Key-Value pairs, and redundant computation as fixed sparse patterns fail to
leverage DiT's dynamic attention. To overcome this limitation, we propose a
Bidirectional Sparse Attention (BSA) framework for faster video DiT training,
the first to dynamically sparsify both Queries and Key-Value pairs within 3D
full attention, thereby substantially improving training and inference
efficiency. BSA addresses these issues through two key components. Query
sparsity is optimized by selecting the most informative query tokens via
semantic similarity and with a dynamic spatial-time training strategy, while KV
sparsity is achieved by computing a statistical dynamic threshold to retain
only the most salient KV blocks for computation. Extensive experiments
demonstrate that BSA significantly accelerates DiT training across long
sequences, reducing FLOPs by up to 20x and achieving 17.79x faster attention
training, while preserving or even surpassing the generative quality of full
attention.
♻ ☆ GAPrompt: Geometry-Aware Point Cloud Prompt for 3D Vision Model ICML 2025
Pre-trained 3D vision models have gained significant attention for their
promising performance on point cloud data. However, fully fine-tuning these
models for downstream tasks is computationally expensive and storage-intensive.
Existing parameter-efficient fine-tuning (PEFT) approaches, which focus
primarily on input token prompting, struggle to achieve competitive performance
due to their limited ability to capture the geometric information inherent in
point clouds. To address this challenge, we propose a novel Geometry-Aware
Point Cloud Prompt (GAPrompt) that leverages geometric cues to enhance the
adaptability of 3D vision models. First, we introduce a Point Prompt that
serves as an auxiliary input alongside the original point cloud, explicitly
guiding the model to capture fine-grained geometric details. Additionally, we
present a Point Shift Prompter designed to extract global shape information
from the point cloud, enabling instance-specific geometric adjustments at the
input level. Moreover, our proposed Prompt Propagation mechanism incorporates
the shape information into the model's feature extraction process, further
strengthening its ability to capture essential geometric characteristics.
Extensive experiments demonstrate that GAPrompt significantly outperforms
state-of-the-art PEFT methods and achieves competitive results compared to full
fine-tuning on various benchmarks, while utilizing only 2.19% of trainable
parameters. Our code is available at
https://github.com/zhoujiahuan1991/ICML2025-GAPrompt.
comment: Accepted by ICML 2025
♻ ☆ Early Exit and Multi Stage Knowledge Distillation in VLMs for Video Summarization
We introduce DEEVISum (Distilled Early Exit Vision language model for
Summarization), a lightweight, efficient, and scalable vision language model
designed for segment wise video summarization. Leveraging multi modal prompts
that combine textual and audio derived signals, DEEVISum incorporates Multi
Stage Knowledge Distillation (MSKD) and Early Exit (EE) to strike a balance
between performance and efficiency. MSKD offers a 1.33% absolute F1 improvement
over baseline distillation (0.5%), while EE reduces inference time by
approximately 21% with a 1.3 point drop in F1. Evaluated on the TVSum dataset,
our best model PaLI Gemma2 3B + MSKD achieves an F1 score of 61.1, competing
the performance of significantly larger models, all while maintaining a lower
computational footprint. We publicly release our code and processed dataset to
support further research.
♻ ☆ Attention-Guided Multi-scale Interaction Network for Face Super-Resolution IEEE
Recently, CNN and Transformer hybrid networks demonstrated excellent
performance in face super-resolution (FSR) tasks. Since numerous features at
different scales in hybrid networks, how to fuse these multiscale features and
promote their complementarity is crucial for enhancing FSR. However, existing
hybrid network-based FSR methods ignore this, only simply combining the
Transformer and CNN. To address this issue, we propose an attention-guided
Multiscale interaction network (AMINet), which incorporates local and global
feature interactions, as well as encoder-decoder phase feature interactions.
Specifically, we propose a Local and Global Feature Interaction Module (LGFI)
to promote the fusion of global features and the local features extracted from
different receptive fields by our Residual Depth Feature Extraction Module
(RDFE). Additionally, we propose a Selective Kernel Attention Fusion Module
(SKAF) to adaptively select fusions of different features within the LGFI and
encoder-decoder phases. Our above design allows the free flow of multiscale
features from within modules and between the encoder and decoder, which can
promote the complementarity of different scale features to enhance FSR.
Comprehensive experiments confirm that our method consistently performs well
with less computational consumption and faster inference.
comment: accepted by IEEE Transactions on Systems, Man and Cybernetics:Systems
(TSMC)
♻ ☆ EgoAgent: A Joint Predictive Agent Model in Egocentric Worlds
Lu Chen, Yizhou Wang, Shixiang Tang, Qianhong Ma, Tong He, Wanli Ouyang, Xiaowei Zhou, Hujun Bao, Sida Peng
Learning an agent model that behaves like humans-capable of jointly
perceiving the environment, predicting the future, and taking actions from a
first-person perspective-is a fundamental challenge in computer vision.
Existing methods typically train separate models for these abilities, which
fail to capture their intrinsic relationships and prevent them from learning
from each other. Inspired by how humans learn through the perception-action
loop, we propose EgoAgent, a unified agent model that simultaneously learns to
represent, predict, and act within a single transformer. EgoAgent explicitly
models the causal and temporal dependencies among these abilities by
formulating the task as an interleaved sequence of states and actions. It
further introduces a joint embedding-action-prediction architecture with
temporally asymmetric predictor and observer branches, enabling synergistic
optimization across all three capabilities. Comprehensive evaluations of
EgoAgent on representative tasks such as image classification, egocentric
future state prediction, and 3D human motion prediction demonstrate the
superiority of our method. The code and trained models will be publicly
available at https://github.com/zju3dv/EgoAgent.
comment: Project Page: https://egoagent.github.io | Demo Video:
https://youtu.be/qhfHp_sfDvY
♻ ☆ Imagine, Verify, Execute: Memory-guided Agentic Exploration with Vision-Language Models
Exploration is essential for general-purpose robotic learning, especially in
open-ended environments where dense rewards, explicit goals, or task-specific
supervision are scarce. Vision-language models (VLMs), with their semantic
reasoning over objects, spatial relations, and potential outcomes, present a
compelling foundation for generating high-level exploratory behaviors. However,
their outputs are often ungrounded, making it difficult to determine whether
imagined transitions are physically feasible or informative. To bridge the gap
between imagination and execution, we present IVE (Imagine, Verify, Execute),
an agentic exploration framework inspired by human curiosity. Human exploration
is often driven by the desire to discover novel scene configurations and to
deepen understanding of the environment. Similarly, IVE leverages VLMs to
abstract RGB-D observations into semantic scene graphs, imagine novel scenes,
predict their physical plausibility, and generate executable skill sequences
through action tools. We evaluate IVE in both simulated and real-world tabletop
environments. The results show that IVE enables more diverse and meaningful
exploration than RL baselines, as evidenced by a 4.1 to 7.8x increase in the
entropy of visited states. Moreover, the collected experience supports
downstream learning, producing policies that closely match or exceed the
performance of those trained on human-collected demonstrations.
comment: Project webpage: https://ive-robot.github.io/
♻ ☆ Drawing2CAD: Sequence-to-Sequence Learning for CAD Generation from Vector Drawings ACM MM 2025
Computer-Aided Design (CAD) generative modeling is driving significant
innovations across industrial applications. Recent works have shown remarkable
progress in creating solid models from various inputs such as point clouds,
meshes, and text descriptions. However, these methods fundamentally diverge
from traditional industrial workflows that begin with 2D engineering drawings.
The automatic generation of parametric CAD models from these 2D vector drawings
remains underexplored despite being a critical step in engineering design. To
address this gap, our key insight is to reframe CAD generation as a
sequence-to-sequence learning problem where vector drawing primitives directly
inform the generation of parametric CAD operations, preserving geometric
precision and design intent throughout the transformation process. We propose
Drawing2CAD, a framework with three key technical components: a
network-friendly vector primitive representation that preserves precise
geometric information, a dual-decoder transformer architecture that decouples
command type and parameter generation while maintaining precise correspondence,
and a soft target distribution loss function accommodating inherent flexibility
in CAD parameters. To train and evaluate Drawing2CAD, we create CAD-VGDrawing,
a dataset of paired engineering drawings and parametric CAD models, and conduct
thorough experiments to demonstrate the effectiveness of our method. Code and
dataset are available at https://github.com/lllssc/Drawing2CAD.
comment: Accepted to ACM MM 2025
♻ ☆ ForestSplats: Deformable transient field for Gaussian Splatting in the Wild
Recently, 3D Gaussian Splatting (3D-GS) has emerged, showing real-time
rendering speeds and high-quality results in static scenes. Although 3D-GS
shows effectiveness in static scenes, their performance significantly degrades
in real-world environments due to transient objects, lighting variations, and
diverse levels of occlusion. To tackle this, existing methods estimate
occluders or transient elements by leveraging pre-trained models or integrating
additional transient field pipelines. However, these methods still suffer from
two defects: 1) Using semantic features from the Vision Foundation model (VFM)
causes additional computational costs. 2) The transient field requires
significant memory to handle transient elements with per-view Gaussians and
struggles to define clear boundaries for occluders, solely relying on
photometric errors. To address these problems, we propose ForestSplats, a novel
approach that leverages the deformable transient field and a superpixel-aware
mask to efficiently represent transient elements in the 2D scene across
unconstrained image collections and effectively decompose static scenes from
transient distractors without VFM. We designed the transient field to be
deformable, capturing per-view transient elements. Furthermore, we introduce a
superpixel-aware mask that clearly defines the boundaries of occluders by
considering photometric errors and superpixels. Additionally, we propose
uncertainty-aware densification to avoid generating Gaussians within the
boundaries of occluders during densification. Through extensive experiments
across several benchmark datasets, we demonstrate that ForestSplats outperforms
existing methods without VFM and shows significant memory efficiency in
representing transient elements.
♻ ☆ Parasite: A Steganography-based Backdoor Attack Framework for Diffusion Models
Recently, the diffusion model has gained significant attention as one of the
most successful image generation models, which can generate high-quality images
by iteratively sampling noise. However, recent studies have shown that
diffusion models are vulnerable to backdoor attacks, allowing attackers to
enter input data containing triggers to activate the backdoor and generate
their desired output. Existing backdoor attack methods primarily focused on
target noise-to-image and text-to-image tasks, with limited work on backdoor
attacks in image-to-image tasks. Furthermore, traditional backdoor attacks
often rely on a single, conspicuous trigger to generate a fixed target image,
lacking concealability and flexibility. To address these limitations, we
propose a novel backdoor attack method called "Parasite" for image-to-image
tasks in diffusion models, which not only is the first to leverage
steganography for triggers hiding, but also allows attackers to embed the
target content as a backdoor trigger to achieve a more flexible attack.
"Parasite" as a novel attack method effectively bypasses existing detection
frameworks to execute backdoor attacks. In our experiments, "Parasite" achieved
a 0 percent backdoor detection rate against the mainstream defense frameworks.
In addition, in the ablation study, we discuss the influence of different
hiding coefficients on the attack results. You can find our code at
https://anonymous.4open.science/r/Parasite-1715/.
♻ ☆ AdvReal: Physical Adversarial Patch Generation Framework for Security Evaluation of Object Detection Systems
Autonomous vehicles are typical complex intelligent systems with artificial
intelligence at their core. However, perception methods based on deep learning
are extremely vulnerable to adversarial samples, resulting in security
accidents. How to generate effective adversarial examples in the physical world
and evaluate object detection systems is a huge challenge. In this study, we
propose a unified joint adversarial training framework for both 2D and 3D
domains, which simultaneously optimizes texture maps in 2D image and 3D mesh
spaces to better address intra-class diversity and real-world environmental
variations. The framework includes a novel realistic enhanced adversarial
module, with time-space and relighting mapping pipeline that adjusts
illumination consistency between adversarial patches and target garments under
varied viewpoints. Building upon this, we develop a realism enhancement
mechanism that incorporates non-rigid deformation modeling and texture
remapping to ensure alignment with the human body's non-rigid surfaces in 3D
scenes. Extensive experiment results in digital and physical environments
demonstrate that the adversarial textures generated by our method can
effectively mislead the target detection model. Specifically, our method
achieves an average attack success rate (ASR) of 70.13% on YOLOv12 in physical
scenarios, significantly outperforming existing methods such as T-SEA (21.65%)
and AdvTexture (19.70%). Moreover, the proposed method maintains stable ASR
across multiple viewpoints and distances, with an average attack success rate
exceeding 90% under both frontal and oblique views at a distance of 4 meters.
This confirms the method's strong robustness and transferability under
multi-angle attacks, varying lighting conditions, and real-world distances. The
demo video and code can be obtained at
https://github.com/Huangyh98/AdvReal.git.
♻ ☆ C3VDv2 -- Colonoscopy 3D video dataset with enhanced realism
Mayank V. Golhar, Lucas Sebastian Galeano Fretes, Loren Ayers, Venkata S. Akshintala, Taylor L. Bobrow, Nicholas J. Durr
Spatial computer vision techniques have the potential to improve the
diagnostic performance of colonoscopy. However, the lack of 3D colonoscopy
datasets for training and validation hinders their development. This paper
introduces C3VDv2, the second version (v2) of the high-definition Colonoscopy
3D Video Dataset, featuring enhanced realism designed to facilitate the
quantitative evaluation of 3D colon reconstruction algorithms. 192 video
sequences totaling 169,371 frames were captured by imaging 60 unique,
high-fidelity silicone colon phantom segments. Ground truth depth, surface
normals, optical flow, occlusion, diffuse maps, six-degree-of-freedom pose,
coverage map, and 3D models are provided for 169 colonoscopy videos. Eight
simulated screening colonoscopy videos acquired by a gastroenterologist are
provided with ground truth poses. Lastly, the dataset includes 15 videos with
colon deformations for qualitative assessment. C3VDv2 emulates diverse and
challenging scenarios for 3D reconstruction algorithms, including fecal debris,
mucous pools, blood, debris obscuring the colonoscope lens, en-face views, and
fast camera motion. The enhanced realism of C3VDv2 will allow for more robust
and representative development and evaluation of 3D reconstruction algorithms.
Project Page - https://durrlab.github.io/C3VDv2/
comment: 19 pages, 7 figures
♻ ☆ Combating Falsification of Speech Videos with Live Optical Signatures (Extended Version) CCS '25
High-profile speech videos are prime targets for falsification, owing to
their accessibility and influence. This work proposes VeriLight, a low-overhead
and unobtrusive system for protecting speech videos from visual manipulations
of speaker identity and lip and facial motion. Unlike the predominant purely
digital falsification detection methods, VeriLight creates dynamic physical
signatures at the event site and embeds them into all video recordings via
imperceptible modulated light. These physical signatures encode
semantically-meaningful features unique to the speech event, including the
speaker's identity and facial motion, and are cryptographically-secured to
prevent spoofing. The signatures can be extracted from any video downstream and
validated against the portrayed speech content to check its integrity. Key
elements of VeriLight include (1) a framework for generating extremely compact
(i.e., 150-bit), pose-invariant speech video features, based on
locality-sensitive hashing; and (2) an optical modulation scheme that embeds
$>$200 bps into video while remaining imperceptible both in video and live.
Experiments on extensive video datasets show VeriLight achieves AUCs $\geq$
0.99 and a true positive rate of 100% in detecting falsified videos. Further,
VeriLight is highly robust across recording conditions, video post-processing
techniques, and white-box adversarial attacks on its feature extraction
methods. A demonstration of VeriLight is available at
https://mobilex.cs.columbia.edu/verilight.
comment: In Proceedings of the 2025 ACM SIGSAC Conference on Computer and
Communications Security (CCS '25). October 13 - 17, 2025, Taipei, Taiwan.
ACM, New York, NY, USA. 19 pages