Computer Vision and Pattern Recognition 104
☆ GenExam: A Multidisciplinary Text-to-Image Exam
Exams are a fundamental test of expert-level intelligence and require
integrated understanding, reasoning, and generation. Existing exam-style
benchmarks mainly focus on understanding and reasoning tasks, and current
generation benchmarks emphasize the illustration of world knowledge and visual
concepts, neglecting the evaluation of rigorous drawing exams. We introduce
GenExam, the first benchmark for multidisciplinary text-to-image exams,
featuring 1,000 samples across 10 subjects with exam-style prompts organized
under a four-level taxonomy. Each problem is equipped with ground-truth images
and fine-grained scoring points to enable a precise evaluation of semantic
correctness and visual plausibility. Experiments show that even
state-of-the-art models such as GPT-Image-1 and Gemini-2.5-Flash-Image achieve
less than 15% strict scores, and most models yield almost 0%, suggesting the
great challenge of our benchmark. By framing image generation as an exam,
GenExam offers a rigorous assessment of models' ability to integrate knowledge,
reasoning, and generation, providing insights on the path to general AGI.
☆ Cinéaste: A Fine-grained Contextual Movie Question Answering Benchmark
While recent advancements in vision-language models have improved video
understanding, diagnosing their capacity for deep, narrative comprehension
remains a challenge. Existing benchmarks often test short-clip recognition or
use template-based questions, leaving a critical gap in evaluating fine-grained
reasoning over long-form narrative content. To address these gaps, we introduce
$\mathsf{Cin\acute{e}aste}$, a comprehensive benchmark for long-form movie
understanding. Our dataset comprises 3,119 multiple-choice question-answer
pairs derived from 1,805 scenes across 200 diverse movies, spanning five novel
fine-grained contextual reasoning categories. We use GPT-4o to generate
diverse, context-rich questions by integrating visual descriptions, captions,
scene titles, and summaries, which require deep narrative understanding. To
ensure high-quality evaluation, our pipeline incorporates a two-stage filtering
process: Context-Independence filtering ensures questions require video
context, while Contextual Veracity filtering validates factual consistency
against the movie content, mitigating hallucinations. Experiments show that
existing MLLMs struggle on $\mathsf{Cin\acute{e}aste}$; our analysis reveals
that long-range temporal reasoning is a primary bottleneck, with the top
open-source model achieving only 63.15\% accuracy. This underscores significant
challenges in fine-grained contextual understanding and the need for
advancements in long-form movie comprehension.
comment: 11 pages, 5 figures, 5 tables
☆ Dense Video Understanding with Gated Residual Tokenization
High temporal resolution is essential for capturing fine-grained details in
video understanding. However, current video large language models (VLLMs) and
benchmarks mostly rely on low-frame-rate sampling, such as uniform sampling or
keyframe selection, discarding dense temporal information. This compromise
avoids the high cost of tokenizing every frame, which otherwise leads to
redundant computation and linear token growth as video length increases. While
this trade-off works for slowly changing content, it fails for tasks like
lecture comprehension, where information appears in nearly every frame and
requires precise temporal alignment. To address this gap, we introduce Dense
Video Understanding (DVU), which enables high-FPS video comprehension by
reducing both tokenization time and token overhead. Existing benchmarks are
also limited, as their QA pairs focus on coarse content changes. We therefore
propose DIVE (Dense Information Video Evaluation), the first benchmark designed
for dense temporal reasoning. To make DVU practical, we present Gated Residual
Tokenization (GRT), a two-stage framework: (1) Motion-Compensated Inter-Gated
Tokenization uses pixel-level motion estimation to skip static regions during
tokenization, achieving sub-linear growth in token count and compute. (2)
Semantic-Scene Intra-Tokenization Merging fuses tokens across static regions
within a scene, further reducing redundancy while preserving dynamic semantics.
Experiments on DIVE show that GRT outperforms larger VLLM baselines and scales
positively with FPS. These results highlight the importance of dense temporal
information and demonstrate that GRT enables efficient, scalable high-FPS video
understanding.
☆ MCGS-SLAM: A Multi-Camera SLAM Framework Using Gaussian Splatting for High-Fidelity Mapping
Zhihao Cao, Hanyu Wu, Li Wa Tang, Zizhou Luo, Zihan Zhu, Wei Zhang, Marc Pollefeys, Martin R. Oswald
Recent progress in dense SLAM has primarily targeted monocular setups, often
at the expense of robustness and geometric coverage. We present MCGS-SLAM, the
first purely RGB-based multi-camera SLAM system built on 3D Gaussian Splatting
(3DGS). Unlike prior methods relying on sparse maps or inertial data, MCGS-SLAM
fuses dense RGB inputs from multiple viewpoints into a unified, continuously
optimized Gaussian map. A multi-camera bundle adjustment (MCBA) jointly refines
poses and depths via dense photometric and geometric residuals, while a scale
consistency module enforces metric alignment across views using low-rank
priors. The system supports RGB input and maintains real-time performance at
large scale. Experiments on synthetic and real-world datasets show that
MCGS-SLAM consistently yields accurate trajectories and photorealistic
reconstructions, usually outperforming monocular baselines. Notably, the wide
field of view from multi-camera input enables reconstruction of side-view
regions that monocular setups miss, critical for safe autonomous operation.
These results highlight the promise of multi-camera Gaussian Splatting SLAM for
high-fidelity mapping in robotics and autonomous driving.
☆ Where Do Tokens Go? Understanding Pruning Behaviors in STEP at High Resolutions
Vision Transformers (ViTs) achieve state-of-the-art performance in semantic
segmentation but are hindered by high computational and memory costs. To
address this, we propose STEP (SuperToken and Early-Pruning), a hybrid
token-reduction framework that combines dynamic patch merging and token pruning
to enhance efficiency without significantly compromising accuracy. At the core
of STEP is dCTS, a lightweight CNN-based policy network that enables flexible
merging into superpatches. Encoder blocks integrate also early-exits to remove
high-confident supertokens, lowering computational load. We evaluate our method
on high-resolution semantic segmentation benchmarks, including images up to
1024 x 1024, and show that when dCTS is applied alone, the token count can be
reduced by a factor of 2.5 compared to the standard 16 x 16 pixel patching
scheme. This yields a 2.6x reduction in computational cost and a 3.4x increase
in throughput when using ViT-Large as the backbone. Applying the full STEP
framework further improves efficiency, reaching up to a 4x reduction in
computational complexity and a 1.7x gain in inference speed, with a maximum
accuracy drop of no more than 2.0%. With the proposed STEP configurations, up
to 40% of tokens can be confidently predicted and halted before reaching the
final encoder layer.
☆ BEVUDA++: Geometric-aware Unsupervised Domain Adaptation for Multi-View 3D Object Detection IEEE
Vision-centric Bird's Eye View (BEV) perception holds considerable promise
for autonomous driving. Recent studies have prioritized efficiency or accuracy
enhancements, yet the issue of domain shift has been overlooked, leading to
substantial performance degradation upon transfer. We identify major domain
gaps in real-world cross-domain scenarios and initiate the first effort to
address the Domain Adaptation (DA) challenge in multi-view 3D object detection
for BEV perception. Given the complexity of BEV perception approaches with
their multiple components, domain shift accumulation across multi-geometric
spaces (e.g., 2D, 3D Voxel, BEV) poses a significant challenge for BEV domain
adaptation. In this paper, we introduce an innovative geometric-aware
teacher-student framework, BEVUDA++, to diminish this issue, comprising a
Reliable Depth Teacher (RDT) and a Geometric Consistent Student (GCS) model.
Specifically, RDT effectively blends target LiDAR with dependable depth
predictions to generate depth-aware information based on uncertainty
estimation, enhancing the extraction of Voxel and BEV features that are
essential for understanding the target domain. To collaboratively reduce the
domain shift, GCS maps features from multiple spaces into a unified geometric
embedding space, thereby narrowing the gap in data distribution between the two
domains. Additionally, we introduce a novel Uncertainty-guided Exponential
Moving Average (UEMA) to further reduce error accumulation due to domain shifts
informed by previously obtained uncertainty guidance. To demonstrate the
superiority of our proposed method, we execute comprehensive experiments in
four cross-domain scenarios, securing state-of-the-art performance in BEV 3D
object detection tasks, e.g., 12.9\% NDS and 9.5\% mAP enhancement on Day-Night
adaptation.
comment: Accepted by IEEE TCSVT
☆ An Exploratory Study on Abstract Images and Visual Representations Learned from Them BMVC 2025
Imagine living in a world composed solely of primitive shapes, could you
still recognise familiar objects? Recent studies have shown that abstract
images-constructed by primitive shapes-can indeed convey visual semantic
information to deep learning models. However, representations obtained from
such images often fall short compared to those derived from traditional raster
images. In this paper, we study the reasons behind this performance gap and
investigate how much high-level semantic content can be captured at different
abstraction levels. To this end, we introduce the Hierarchical Abstraction
Image Dataset (HAID), a novel data collection that comprises abstract images
generated from normal raster images at multiple levels of abstraction. We then
train and evaluate conventional vision systems on HAID across various tasks
including classification, segmentation, and object detection, providing a
comprehensive study between rasterised and abstract image representations. We
also discuss if the abstract image can be considered as a potentially effective
format for conveying visual semantic information and contributing to vision
tasks.
comment: Accepted to BMVC 2025
☆ MARS2 2025 Challenge on Multimodal Reasoning: Datasets, Methods, Results, Discussion, and Outlook ICCV 2025
Peng Xu, Shengwu Xiong, Jiajun Zhang, Yaxiong Chen, Bowen Zhou, Chen Change Loy, David A. Clifton, Kyoung Mu Lee, Luc Van Gool, Ruiming He, Ruilin Yao, Xinwei Long, Jirui Huang, Kai Tian, Sa Yang, Yihua Shao, Jin Feng, Yue Zhong, Jiakai Zhou, Cheng Tang, Tianyu Zou, Yifang Zhang, Junming Liang, Guoyou Li, Zhaoxiang Wang, Qiang Zhou, Yichen Zhao, Shili Xiong, Hyeongjin Nam, Jaerin Lee, Jaeyoung Chung, JoonKyu Park, Junghun Oh, Kanggeon Lee, Wooseok Lee, Juneyoung Ro, Turghun Osman, Can Hu, Chaoyang Liao, Cheng Chen, Chengcheng Han, Chenhao Qiu, Chong Peng, Cong Xu, Dailin Li, Feiyu Wang, Feng Gao, Guibo Zhu, Guopeng Tang, Haibo Lu, Han Fang, Han Qi, Hanxiao Wu, Haobo Cheng, Hongbo Sun, Hongyao Chen, Huayong Hu, Hui Li, Jiaheng Ma, Jiang Yu, Jianing Wang, Jie Yang, Jing He, Jinglin Zhou, Jingxuan Li, Josef Kittler, Lihao Zheng, Linnan Zhao, Mengxi Jia, Muyang Yan, Nguyen Thanh Thien, Pu Luo, Qi Li, Shien Song, Shijie Dong, Shuai Shao, Shutao Li, Taofeng Xue, Tianyang Xu, Tianyi Gao, Tingting Li, Wei Zhang, Weiyang Su, Xiaodong Dong, Xiao-Jun Wu, Xiaopeng Zhou, Xin Chen, Xin Wei, Xinyi You, Xudong Kang, Xujie Zhou, Xusheng Liu, Yanan Wang, Yanbin Huang, Yang Liu, Yang Yang, Yanglin Deng, Yashu Kang, Ye Yuan, Yi Wen, Yicen Tian, Yilin Tao, Yin Tang, Yipeng Lin, Yiqing Wang, Yiting Xi, Yongkang Yu, Yumei Li, Yuxin Qin, Yuying Chen, Yuzhe Cen, Zhaofan Zou, Zhaohong Liu, Zhehao Shen, Zhenglin Du, Zhengyang Li, Zhenni Huang, Zhenwei Shao, Zhilong Song, Zhiyong Feng, Zhiyu Wang, Zhou Yu, Ziang Li, Zihan Zhai, Zijian Zhang, Ziyang Peng, Ziyun Xiao, Zongshu Li
This paper reviews the MARS2 2025 Challenge on Multimodal Reasoning. We aim
to bring together different approaches in multimodal machine learning and LLMs
via a large benchmark. We hope it better allows researchers to follow the
state-of-the-art in this very dynamic area. Meanwhile, a growing number of
testbeds have boosted the evolution of general-purpose large language models.
Thus, this year's MARS2 focuses on real-world and specialized scenarios to
broaden the multimodal reasoning applications of MLLMs. Our organizing team
released two tailored datasets Lens and AdsQA as test sets, which support
general reasoning in 12 daily scenarios and domain-specific reasoning in
advertisement videos, respectively. We evaluated 40+ baselines that include
both generalist MLLMs and task-specific models, and opened up three competition
tracks, i.e., Visual Grounding in Real-world Scenarios (VG-RS), Visual Question
Answering with Spatial Awareness (VQA-SA), and Visual Reasoning in Creative
Advertisement Videos (VR-Ads). Finally, 76 teams from the renowned academic and
industrial institutions have registered and 40+ valid submissions (out of
1200+) have been included in our ranking lists. Our datasets, code sets (40+
baselines and 15+ participants' methods), and rankings are publicly available
on the MARS2 workshop website and our GitHub organization page
https://github.com/mars2workshop/, where our updates and announcements of
upcoming events will be continuously provided.
comment: ICCV 2025 MARS2 Workshop and Challenge "Multimodal Reasoning and Slow
Thinking in the Large Model Era: Towards System 2 and Beyond''
☆ Deceptive Beauty: Evaluating the Impact of Beauty Filters on Deepfake and Morphing Attack Detection IEEE
Sara Concas, Simone Maurizio La Cava, Andrea Panzino, Ester Masala, Giulia Orrù, Gian Luca Marcialis
Digital beautification through social media filters has become increasingly
popular, raising concerns about the reliability of facial images and videos and
the effectiveness of automated face analysis. This issue is particularly
critical for digital manipulation detectors, systems aiming at distinguishing
between genuine and manipulated data, especially in cases involving deepfakes
and morphing attacks designed to deceive humans and automated facial
recognition. This study examines whether beauty filters impact the performance
of deepfake and morphing attack detectors. We perform a comprehensive analysis,
evaluating multiple state-of-the-art detectors on benchmark datasets before and
after applying various smoothing filters. Our findings reveal performance
degradation, highlighting vulnerabilities introduced by facial enhancements and
underscoring the need for robust detection models resilient to such
alterations.
comment: Accepted at the 2025 IEEE INTERNATIONAL CONFERENCE ON Metrology for
eXtended Reality, Artificial Intelligence and Neural Engineering
☆ Generative AI for Misalignment-Resistant Virtual Staining to Accelerate Histopathology Workflows
Jiabo MA, Wenqiang Li, Jinbang Li, Ziyi Liu, Linshan Wu, Fengtao Zhou, Li Liang, Ronald Cheong Kin Chan, Terence T. W. Wong, Hao Chen
Accurate histopathological diagnosis often requires multiple differently
stained tissue sections, a process that is time-consuming, labor-intensive, and
environmentally taxing due to the use of multiple chemical stains. Recently,
virtual staining has emerged as a promising alternative that is faster,
tissue-conserving, and environmentally friendly. However, existing virtual
staining methods face significant challenges in clinical applications,
primarily due to their reliance on well-aligned paired data. Obtaining such
data is inherently difficult because chemical staining processes can distort
tissue structures, and a single tissue section cannot undergo multiple staining
procedures without damage or loss of information. As a result, most available
virtual staining datasets are either unpaired or roughly paired, making it
difficult for existing methods to achieve accurate pixel-level supervision. To
address this challenge, we propose a robust virtual staining framework
featuring cascaded registration mechanisms to resolve spatial mismatches
between generated outputs and their corresponding ground truth. Experimental
results demonstrate that our method significantly outperforms state-of-the-art
models across five datasets, achieving an average improvement of 3.2% on
internal datasets and 10.1% on external datasets. Moreover, in datasets with
substantial misalignment, our approach achieves a remarkable 23.8% improvement
in peak signal-to-noise ratio compared to baseline models. The exceptional
robustness of the proposed method across diverse datasets simplifies the data
acquisition process for virtual staining and offers new insights for advancing
its development.
comment: the arxiv version of the under review journal paper
☆ CSMoE: An Efficient Remote Sensing Foundation Model with Soft Mixture-of-Experts
Self-supervised learning through masked autoencoders has attracted great
attention for remote sensing (RS) foundation model (FM) development, enabling
improved representation learning across diverse sensors and downstream tasks.
However, existing RS FMs often either suffer from substantial computational
complexity during both training and inference or exhibit limited
representational capacity. These issues restrict their practical applicability
in RS. To address this limitation, we propose an adaptation for enhancing the
efficiency of RS FMs by integrating the Soft mixture-of-experts (MoE) mechanism
into the FM. The integration of Soft MoEs into the FM allows modality-specific
expert specialization alongside shared cross-sensor representation learning. To
demonstrate the effectiveness of our adaptation, we apply it on the
Cross-Sensor Masked Autoencoder (CSMAE) model, resulting in the Cross-Sensor
Mixture-of-Experts (CSMoE) model. In addition, we introduce a thematic-climatic
descriptor-driven sampling strategy for the construction of a representative
and diverse training set to train our CSMoE model. Extensive experiments on
scene classification, semantic segmentation, and content-based image retrieval
demonstrate that our adaptation yields a reduction in computational
requirements while maintaining or improving representational performance.
Compared to state-of-the-art RS FMs, CSMoE achieves a superior trade-off
between representational capacity, accuracy, and computational efficiency. On
average, CSMoE achieves more than twice the computational efficiency of
existing RS FMs, while maintaining competitive performance across all
experiments. These results show the effectiveness of the proposed adaptation
for creating computationally efficient RS FMs. The code for the model, the
training set creation, and the model weights will be available at
https://git.tu-berlin.de/rsim/csmoe.
☆ Teacher-Guided Pseudo Supervision and Cross-Modal Alignment for Audio-Visual Video Parsing
Weakly-supervised audio-visual video parsing (AVVP) seeks to detect audible,
visible, and audio-visual events without temporal annotations. Previous work
has emphasized refining global predictions through contrastive or collaborative
learning, but neglected stable segment-level supervision and class-aware
cross-modal alignment. To address this, we propose two strategies: (1) an
exponential moving average (EMA)-guided pseudo supervision framework that
generates reliable segment-level masks via adaptive thresholds or top-k
selection, offering stable temporal guidance beyond video-level labels; and (2)
a class-aware cross-modal agreement (CMA) loss that aligns audio and visual
embeddings at reliable segment-class pairs, ensuring consistency across
modalities while preserving temporal structure. Evaluations on LLP and UnAV-100
datasets shows that our method achieves state-of-the-art (SOTA) performance
across multiple metrics.
☆ AD-DINOv3: Enhancing DINOv3 for Zero-Shot Anomaly Detection with Anomaly-Aware Calibration
Zero-Shot Anomaly Detection (ZSAD) seeks to identify anomalies from arbitrary
novel categories, offering a scalable and annotation-efficient solution.
Traditionally, most ZSAD works have been based on the CLIP model, which
performs anomaly detection by calculating the similarity between visual and
text embeddings. Recently, vision foundation models such as DINOv3 have
demonstrated strong transferable representation capabilities. In this work, we
are the first to adapt DINOv3 for ZSAD. However, this adaptation presents two
key challenges: (i) the domain bias between large-scale pretraining data and
anomaly detection tasks leads to feature misalignment; and (ii) the inherent
bias toward global semantics in pretrained representations often leads to
subtle anomalies being misinterpreted as part of the normal foreground objects,
rather than being distinguished as abnormal regions. To overcome these
challenges, we introduce AD-DINOv3, a novel vision-language multimodal
framework designed for ZSAD. Specifically, we formulate anomaly detection as a
multimodal contrastive learning problem, where DINOv3 is employed as the visual
backbone to extract patch tokens and a CLS token, and the CLIP text encoder
provides embeddings for both normal and abnormal prompts. To bridge the domain
gap, lightweight adapters are introduced in both modalities, enabling their
representations to be recalibrated for the anomaly detection task. Beyond this
baseline alignment, we further design an Anomaly-Aware Calibration Module
(AACM), which explicitly guides the CLS token to attend to anomalous regions
rather than generic foreground semantics, thereby enhancing discriminability.
Extensive experiments on eight industrial and medical benchmarks demonstrate
that AD-DINOv3 consistently matches or surpasses state-of-the-art methods,
verifying its superiority as a general zero-shot anomaly detection framework.
☆ VSE-MOT: Multi-Object Tracking in Low-Quality Video Scenes Guided by Visual Semantic Enhancement
Current multi-object tracking (MOT) algorithms typically overlook issues
inherent in low-quality videos, leading to significant degradation in tracking
performance when confronted with real-world image deterioration. Therefore,
advancing the application of MOT algorithms in real-world low-quality video
scenarios represents a critical and meaningful endeavor. To address the
challenges posed by low-quality scenarios, inspired by vision-language models,
this paper proposes a Visual Semantic Enhancement-guided Multi-Object Tracking
framework (VSE-MOT). Specifically, we first design a tri-branch architecture
that leverages a vision-language model to extract global visual semantic
information from images and fuse it with query vectors. Subsequently, to
further enhance the utilization of visual semantic information, we introduce
the Multi-Object Tracking Adapter (MOT-Adapter) and the Visual Semantic Fusion
Module (VSFM). The MOT-Adapter adapts the extracted global visual semantic
information to suit multi-object tracking tasks, while the VSFM improves the
efficacy of feature fusion. Through extensive experiments, we validate the
effectiveness and superiority of the proposed method in real-world low-quality
video scenarios. Its tracking performance metrics outperform those of existing
methods by approximately 8% to 20%, while maintaining robust performance in
conventional scenarios.
☆ Wan-Animate: Unified Character Animation and Replacement with Holistic Replication
Gang Cheng, Xin Gao, Li Hu, Siqi Hu, Mingyang Huang, Chaonan Ji, Ju Li, Dechao Meng, Jinwei Qi, Penchong Qiao, Zhen Shen, Yafei Song, Ke Sun, Linrui Tian, Feng Wang, Guangyuan Wang, Qi Wang, Zhongjian Wang, Jiayu Xiao, Sheng Xu, Bang Zhang, Peng Zhang, Xindi Zhang, Zhe Zhang, Jingren Zhou, Lian Zhuo
We introduce Wan-Animate, a unified framework for character animation and
replacement. Given a character image and a reference video, Wan-Animate can
animate the character by precisely replicating the expressions and movements of
the character in the video to generate high-fidelity character videos.
Alternatively, it can integrate the animated character into the reference video
to replace the original character, replicating the scene's lighting and color
tone to achieve seamless environmental integration. Wan-Animate is built upon
the Wan model. To adapt it for character animation tasks, we employ a modified
input paradigm to differentiate between reference conditions and regions for
generation. This design unifies multiple tasks into a common symbolic
representation. We use spatially-aligned skeleton signals to replicate body
motion and implicit facial features extracted from source images to reenact
expressions, enabling the generation of character videos with high
controllability and expressiveness. Furthermore, to enhance environmental
integration during character replacement, we develop an auxiliary Relighting
LoRA. This module preserves the character's appearance consistency while
applying the appropriate environmental lighting and color tone. Experimental
results demonstrate that Wan-Animate achieves state-of-the-art performance. We
are committed to open-sourcing the model weights and its source code.
comment: Project Page: https://humanaigc.github.io/wan-animate/
☆ PROFUSEme: PROstate Cancer Biochemical Recurrence Prediction via FUSEd Multi-modal Embeddings
Suhang You, Carla Pitarch-Abaigar, Sanket Kachole, Sumedh Sonawane, Juhyung Ha, Anish Sudarshan Gada, David Crandall, Rakesh Shiradkar, Spyridon Bakas
Almost 30% of prostate cancer (PCa) patients undergoing radical prostatectomy
(RP) experience biochemical recurrence (BCR), characterized by increased
prostate specific antigen (PSA) and associated with increased mortality.
Accurate early prediction of BCR, at the time of RP, would contribute to prompt
adaptive clinical decision-making and improved patient outcomes. In this work,
we propose prostate cancer BCR prediction via fused multi-modal embeddings
(PROFUSEme), which learns cross-modal interactions of clinical, radiology, and
pathology data, following an intermediate fusion configuration in combination
with Cox Proportional Hazard regressors. Quantitative evaluation of our
proposed approach reveals superior performance, when compared with late fusion
configurations, yielding a mean C-index of 0.861 ($\sigma=0.112$) on the
internal 5-fold nested cross-validation framework, and a C-index of 0.7103 on
the hold out data of CHIMERA 2025 challenge validation leaderboard.
comment: 11 pages, 1 figure, method paper for CHIMERA 2025 Challenge
☆ SAIL-VL2 Technical Report
Weijie Yin, Yongjie Ye, Fangxun Shu, Yue Liao, Zijian Kang, Hongyuan Dong, Haiyang Yu, Dingkang Yang, Jiacong Wang, Han Wang, Wenzhuo Liu, Xiao Liang, Shuicheng Yan, Chao Feng
We introduce SAIL-VL2, an open-suite vision-language foundation model (LVM)
for comprehensive multimodal understanding and reasoning. As the successor to
SAIL-VL, SAIL-VL2 achieves state-of-the-art performance at the 2B and 8B
parameter scales across diverse image and video benchmarks, demonstrating
strong capabilities from fine-grained perception to complex reasoning. Three
core innovations drive its effectiveness. First, a large-scale data curation
pipeline with scoring and filtering strategies enhances both quality and
distribution across captioning, OCR, QA, and video data, improving training
efficiency. Second, a progressive training framework begins with a powerful
pre-trained vision encoder (SAIL-ViT), advances through multimodal
pre-training, and culminates in a thinking-fusion SFT-RL hybrid paradigm that
systematically strengthens model capabilities. Third, architectural advances
extend beyond dense LLMs to efficient sparse Mixture-of-Experts (MoE) designs.
With these contributions, SAIL-VL2 demonstrates competitive performance across
106 datasets and achieves state-of-the-art results on challenging reasoning
benchmarks such as MMMU and MathVista. Furthermore, on the OpenCompass
leaderboard, SAIL-VL2-2B ranks first among officially released open-source
models under the 4B parameter scale, while serving as an efficient and
extensible foundation for the open-source multimodal community.
comment: Technical Report
☆ Performance Optimization of YOLO-FEDER FusionNet for Robust Drone Detection in Visually Complex Environments
Drone detection in visually complex environments remains challenging due to
background clutter, small object scale, and camouflage effects. While generic
object detectors like YOLO exhibit strong performance in low-texture scenes,
their effectiveness degrades in cluttered environments with low
object-background separability. To address these limitations, this work
presents an enhanced iteration of YOLO-FEDER FusionNet -- a detection framework
that integrates generic object detection with camouflage object detection
techniques. Building upon the original architecture, the proposed iteration
introduces systematic advancements in training data composition, feature fusion
strategies, and backbone design. Specifically, the training process leverages
large-scale, photo-realistic synthetic data, complemented by a small set of
real-world samples, to enhance robustness under visually complex conditions.
The contribution of intermediate multi-scale FEDER features is systematically
evaluated, and detection performance is comprehensively benchmarked across
multiple YOLO-based backbone configurations. Empirical results indicate that
integrating intermediate FEDER features, in combination with backbone upgrades,
contributes to notable performance improvements. In the most promising
configuration -- YOLO-FEDER FusionNet with a YOLOv8l backbone and FEDER
features derived from the DWD module -- these enhancements lead to a FNR
reduction of up to 39.1 percentage points and a mAP increase of up to 62.8
percentage points at an IoU threshold of 0.5, compared to the initial baseline.
☆ MOCHA: Multi-modal Objects-aware Cross-arcHitecture Alignment
We introduce MOCHA (Multi-modal Objects-aware Cross-arcHitecture Alignment),
a knowledge distillation approach that transfers region-level multimodal
semantics from a large vision-language teacher (e.g., LLaVa) into a lightweight
vision-only object detector student (e.g., YOLO). A translation module maps
student features into a joint space, where the training of the student and
translator is guided by a dual-objective loss that enforces both local
alignment and global relational consistency. Unlike prior approaches focused on
dense or global alignment, MOCHA operates at the object level, enabling
efficient transfer of semantics without modifying the teacher or requiring
textual input at inference. We validate our method across four personalized
detection benchmarks under few-shot regimes. Results show consistent gains over
baselines, with a +10.1 average score improvement. Despite its compact
architecture, MOCHA reaches performance on par with larger multimodal models,
proving its suitability for real-world deployment.
☆ MetricNet: Recovering Metric Scale in Generative Navigation Policies
Generative navigation policies have made rapid progress in improving
end-to-end learned navigation. Despite their promising results, this paradigm
has two structural problems. First, the sampled trajectories exist in an
abstract, unscaled space without metric grounding. Second, the control strategy
discards the full path, instead moving directly towards a single waypoint. This
leads to short-sighted and unsafe actions, moving the robot towards obstacles
that a complete and correctly scaled path would circumvent. To address these
issues, we propose MetricNet, an effective add-on for generative navigation
that predicts the metric distance between waypoints, grounding policy outputs
in real-world coordinates. We evaluate our method in simulation with a new
benchmarking framework and show that executing MetricNet-scaled waypoints
significantly improves both navigation and exploration performance. Beyond
simulation, we further validate our approach in real-world experiments.
Finally, we propose MetricNav, which integrates MetricNet into a navigation
policy to guide the robot away from obstacles while still moving towards the
goal.
☆ Can Current AI Models Count What We Mean, Not What They See? A Benchmark and Systematic Evaluation
Visual counting is a fundamental yet challenging task, especially when users
need to count objects of a specific type in complex scenes. While recent
models, including class-agnostic counting models and large vision-language
models (VLMs), show promise in counting tasks, their ability to perform
fine-grained, intent-driven counting remains unclear. In this paper, we
introduce PairTally, a benchmark dataset specifically designed to evaluate
fine-grained visual counting. Each of the 681 high-resolution images in
PairTally contains two object categories, requiring models to distinguish and
count based on subtle differences in shape, size, color, or semantics. The
dataset includes both inter-category (distinct categories) and intra-category
(closely related subcategories) settings, making it suitable for rigorous
evaluation of selective counting capabilities. We benchmark a variety of
state-of-the-art models, including exemplar-based methods, language-prompted
models, and large VLMs. Our results show that despite recent advances, current
models struggle to reliably count what users intend, especially in fine-grained
and visually ambiguous cases. PairTally provides a new foundation for
diagnosing and improving fine-grained visual counting systems.
☆ Noise-Level Diffusion Guidance: Well Begun is Half Done
Diffusion models have achieved state-of-the-art image generation. However,
the random Gaussian noise used to start the diffusion process influences the
final output, causing variations in image quality and prompt adherence.
Existing noise-level optimization approaches generally rely on extra dataset
construction, additional networks, or backpropagation-based optimization,
limiting their practicality. In this paper, we propose Noise Level Guidance
(NLG), a simple, efficient, and general noise-level optimization approach that
refines initial noise by increasing the likelihood of its alignment with
general guidance - requiring no additional training data, auxiliary networks,
or backpropagation. The proposed NLG approach provides a unified framework
generalizable to both conditional and unconditional diffusion models,
accommodating various forms of diffusion-level guidance. Extensive experiments
on five standard benchmarks demonstrate that our approach enhances output
generation quality and input condition adherence. By seamlessly integrating
with existing guidance methods while maintaining computational efficiency, our
method establishes NLG as a practical and scalable enhancement to diffusion
models. Code can be found at
https://github.com/harveymannering/NoiseLevelGuidance.
☆ MAP: End-to-End Autonomous Driving with Map-Assisted Planning ICCV
In recent years, end-to-end autonomous driving has attracted increasing
attention for its ability to jointly model perception, prediction, and planning
within a unified framework. However, most existing approaches underutilize the
online mapping module, leaving its potential to enhance trajectory planning
largely untapped. This paper proposes MAP (Map-Assisted Planning), a novel
map-assisted end-to-end trajectory planning framework. MAP explicitly
integrates segmentation-based map features and the current ego status through a
Plan-enhancing Online Mapping module, an Ego-status-guided Planning module, and
a Weight Adapter based on current ego status. Experiments conducted on the
DAIR-V2X-seq-SPD dataset demonstrate that the proposed method achieves a 16.6%
reduction in L2 displacement error, a 56.2% reduction in off-road rate, and a
44.5% improvement in overall score compared to the UniV2X baseline, even
without post-processing. Furthermore, it achieves top ranking in Track 2 of the
End-to-End Autonomous Driving through V2X Cooperation Challenge of MEIS
Workshop @CVPR2025, outperforming the second-best model by 39.5% in terms of
overall score. These results highlight the effectiveness of explicitly
leveraging semantic map features in planning and suggest new directions for
improving structure design in end-to-end autonomous driving systems. Our code
is available at https://gitee.com/kymkym/map.git
comment: 8 pages, 2 figures, accepted by ICCVW Author list updated to match
the camera-ready version, in compliance with conference policy
☆ Towards Robust Defense against Customization via Protective Perturbation Resistant to Diffusion-based Purification ICCV 2025
Diffusion models like Stable Diffusion have become prominent in visual
synthesis tasks due to their powerful customization capabilities, which also
introduce significant security risks, including deepfakes and copyright
infringement. In response, a class of methods known as protective perturbation
emerged, which mitigates image misuse by injecting imperceptible adversarial
noise. However, purification can remove protective perturbations, thereby
exposing images again to the risk of malicious forgery. In this work, we
formalize the anti-purification task, highlighting challenges that hinder
existing approaches, and propose a simple diagnostic protective perturbation
named AntiPure. AntiPure exposes vulnerabilities of purification within the
"purification-customization" workflow, owing to two guidance mechanisms: 1)
Patch-wise Frequency Guidance, which reduces the model's influence over
high-frequency components in the purified image, and 2) Erroneous Timestep
Guidance, which disrupts the model's denoising strategy across different
timesteps. With additional guidance, AntiPure embeds imperceptible
perturbations that persist under representative purification settings,
achieving effective post-customization distortion. Experiments show that, as a
stress test for purification, AntiPure achieves minimal perceptual discrepancy
and maximal distortion, outperforming other protective perturbation methods
within the purification-customization workflow.
comment: Accepted to ICCV 2025
☆ Towards Rationale-Answer Alignment of LVLMs via Self-Rationale Calibration ICML 2025
Large Vision-Language Models (LVLMs) have manifested strong visual question
answering capability. However, they still struggle with aligning the rationale
and the generated answer, leading to inconsistent reasoning and incorrect
responses. To this end, this paper introduces the Self-Rationale Calibration
(SRC) framework to iteratively calibrate the alignment between rationales and
answers. SRC begins by employing a lightweight "rationale fine-tuning"
approach, which modifies the model's response format to require a rationale
before deriving an answer without explicit prompts. Next, SRC searches for a
diverse set of candidate responses from the fine-tuned LVLMs for each sample,
followed by a proposed pairwise scoring strategy using a tailored scoring
model, R-Scorer, to evaluate both rationale quality and factual consistency of
candidates. Based on a confidence-weighted preference curation process, SRC
decouples the alignment calibration into a preference fine-tuning manner,
leading to significant improvements of LVLMs in perception, reasoning, and
generalization across multiple benchmarks. Our results emphasize the
rationale-oriented alignment in exploring the potential of LVLMs.
comment: Accepted by ICML 2025
☆ White Aggregation and Restoration for Few-shot 3D Point Cloud Semantic Segmentation
Few-Shot 3D Point Cloud Segmentation (FS-PCS) aims to predict per-point
labels for an unlabeled point cloud, given only a few labeled examples. To
extract discriminative representations from the limited support set, existing
methods have constructed prototypes using conventional algorithms such as
farthest point sampling. However, we point out that its initial randomness
significantly affects FS-PCS performance and that the prototype generation
process remains underexplored despite its prevalence. This motivates us to
investigate an advanced prototype generation method based on attention
mechanism. Despite its potential, we found that vanilla module suffers from the
distributional gap between learnable prototypical tokens and support features.
To overcome this, we propose White Aggregation and Restoration Module (WARM),
which resolves the misalignment by sandwiching cross-attention between
whitening and coloring transformations. Specifically, whitening aligns the
support features to prototypical tokens before attention process, and
subsequently coloring restores the original distribution to the attended
tokens. This simple yet effective design enables robust attention, thereby
generating representative prototypes by capturing the semantic relationships
among support features. Our method achieves state-of-the-art performance with a
significant margin on multiple FS-PCS benchmarks, demonstrating its
effectiveness through extensive experiments.
comment: 9 pages, 5 figures
☆ EvHand-FPV: Efficient Event-Based 3D Hand Tracking from First-Person View
Hand tracking holds great promise for intuitive interaction paradigms, but
frame-based methods often struggle to meet the requirements of accuracy, low
latency, and energy efficiency, especially in resource-constrained settings
such as Extended Reality (XR) devices. Event cameras provide $\mu$s-level
temporal resolution at mW-level power by asynchronously sensing brightness
changes. In this work, we present EvHand-FPV, a lightweight framework for
egocentric First-Person-View 3D hand tracking from a single event camera. We
construct an event-based FPV dataset that couples synthetic training data with
3D labels and real event data with 2D labels for evaluation to address the
scarcity of egocentric benchmarks. EvHand-FPV also introduces a wrist-based
region of interest (ROI) that localizes the hand region via geometric cues,
combined with an end-to-end mapping strategy that embeds ROI offsets into the
network to reduce computation without explicit reconstruction, and a multi-task
learning strategy with an auxiliary geometric feature head that improves
representations without test-time overhead. On our real FPV test set,
EvHand-FPV improves 2D-AUCp from 0.77 to 0.85 while reducing parameters from
11.2M to 1.2M by 89% and FLOPs per inference from 1.648G to 0.185G by 89%. It
also maintains a competitive 3D-AUCp of 0.84 on synthetic data. These results
demonstrate accurate and efficient egocentric event-based hand tracking
suitable for on-device XR applications. The dataset and code are available at
https://github.com/zen5x5/EvHand-FPV.
comment: 8 pages
☆ Invisible Yet Detected: PelFANet with Attention-Guided Anatomical Fusion for Pelvic Fracture Diagnosis MICCAI
Siam Tahsin Bhuiyan, Rashedur Rahman, Sefatul Wasi, Naomi Yagi, Syoji Kobashi, Ashraful Islam, Saadia Binte Alam
Pelvic fractures pose significant diagnostic challenges, particularly in
cases where fracture signs are subtle or invisible on standard radiographs. To
address this, we introduce PelFANet, a dual-stream attention network that fuses
raw pelvic X-rays with segmented bone images to improve fracture
classification. The network em-ploys Fused Attention Blocks (FABlocks) to
iteratively exchange and refine fea-tures from both inputs, capturing global
context and localized anatomical detail. Trained in a two-stage pipeline with a
segmentation-guided approach, PelFANet demonstrates superior performance over
conventional methods. On the AMERI dataset, it achieves 88.68% accuracy and
0.9334 AUC on visible fractures, while generalizing effectively to invisible
fracture cases with 82.29% accuracy and 0.8688 AUC, despite not being trained
on them. These results highlight the clini-cal potential of anatomy-aware
dual-input architectures for robust fracture detec-tion, especially in
scenarios with subtle radiographic presentations.
comment: Accepted at MICCAI EMERGE 2025
☆ Distractor-Aware Memory-Based Visual Object Tracking
Recent emergence of memory-based video segmentation methods such as SAM2 has
led to models with excellent performance in segmentation tasks, achieving
leading results on numerous benchmarks. However, these modes are not fully
adjusted for visual object tracking, where distractors (i.e., objects visually
similar to the target) pose a key challenge. In this paper we propose a
distractor-aware drop-in memory module and introspection-based management
method for SAM2, leading to DAM4SAM. Our design effectively reduces the
tracking drift toward distractors and improves redetection capability after
object occlusion. To facilitate the analysis of tracking in the presence of
distractors, we construct DiDi, a Distractor-Distilled dataset. DAM4SAM
outperforms SAM2.1 on thirteen benchmarks and sets new state-of-the-art results
on ten. Furthermore, integrating the proposed distractor-aware memory into a
real-time tracker EfficientTAM leads to 11% improvement and matches tracking
quality of the non-real-time SAM2.1-L on multiple tracking and segmentation
benchmarks, while integration with edge-based tracker EdgeTAM delivers 4%
performance boost, demonstrating a very good generalization across
architectures.
comment: Code available on Github: https://github.com/jovanavidenovic/DAM4SAM
☆ LamiGauss: Pitching Radiative Gaussian for Sparse-View X-ray Laminography Reconstruction
X-ray Computed Laminography (CL) is essential for non-destructive inspection
of plate-like structures in applications such as microchips and composite
battery materials, where traditional computed tomography (CT) struggles due to
geometric constraints. However, reconstructing high-quality volumes from
laminographic projections remains challenging, particularly under highly
sparse-view acquisition conditions. In this paper, we propose a reconstruction
algorithm, namely LamiGauss, that combines Gaussian Splatting radiative
rasterization with a dedicated detector-to-world transformation model
incorporating the laminographic tilt angle. LamiGauss leverages an
initialization strategy that explicitly filters out common laminographic
artifacts from the preliminary reconstruction, preventing redundant Gaussians
from being allocated to false structures and thereby concentrating model
capacity on representing the genuine object. Our approach effectively optimizes
directly from sparse projections, enabling accurate and efficient
reconstruction with limited data. Extensive experiments on both synthetic and
real datasets demonstrate the effectiveness and superiority of the proposed
method over existing techniques. LamiGauss uses only 3$\%$ of full views to
achieve superior performance over the iterative method optimized on a full
dataset.
☆ EDITS: Enhancing Dataset Distillation with Implicit Textual Semantics
Dataset distillation aims to synthesize a compact dataset from the original
large-scale one, enabling highly efficient learning while preserving
competitive model performance. However, traditional techniques primarily
capture low-level visual features, neglecting the high-level semantic and
structural information inherent in images. In this paper, we propose EDITS, a
novel framework that exploits the implicit textual semantics within the image
data to achieve enhanced distillation. First, external texts generated by a
Vision Language Model (VLM) are fused with image features through a Global
Semantic Query module, forming the prior clustered buffer. Local Semantic
Awareness then selects representative samples from the buffer to construct
image and text prototypes, with the latter produced by guiding a Large Language
Model (LLM) with meticulously crafted prompt. Ultimately, Dual Prototype
Guidance strategy generates the final synthetic dataset through a diffusion
model. Extensive experiments confirm the effectiveness of our method.Source
code is available in: https://github.com/einsteinxia/EDITS.
☆ InterKey: Cross-modal Intersection Keypoints for Global Localization on OpenStreetMap
Reliable global localization is critical for autonomous vehicles, especially
in environments where GNSS is degraded or unavailable, such as urban canyons
and tunnels. Although high-definition (HD) maps provide accurate priors, the
cost of data collection, map construction, and maintenance limits scalability.
OpenStreetMap (OSM) offers a free and globally available alternative, but its
coarse abstraction poses challenges for matching with sensor data. We propose
InterKey, a cross-modal framework that leverages road intersections as
distinctive landmarks for global localization. Our method constructs compact
binary descriptors by jointly encoding road and building imprints from point
clouds and OSM. To bridge modality gaps, we introduce discrepancy mitigation,
orientation determination, and area-equalized sampling strategies, enabling
robust cross-modal matching. Experiments on the KITTI dataset demonstrate that
InterKey achieves state-of-the-art accuracy, outperforming recent baselines by
a large margin. The framework generalizes to sensors that can produce dense
structural point clouds, offering a scalable and cost-effective solution for
robust vehicle localization.
comment: 8 pages, 5 figures
☆ SpecDiff: Accelerating Diffusion Model Inference with Self-Speculation
Feature caching has recently emerged as a promising method for diffusion
model acceleration. It effectively alleviates the inefficiency problem caused
by high computational requirements by caching similar features in the inference
process of the diffusion model. In this paper, we analyze existing feature
caching methods from the perspective of information utilization, and point out
that relying solely on historical information will lead to constrained accuracy
and speed performance. And we propose a novel paradigm that introduces future
information via self-speculation based on the information similarity at the
same time step across different iteration times. Based on this paradigm, we
present \textit{SpecDiff}, a training-free multi-level feature caching strategy
including a cached feature selection algorithm and a multi-level feature
classification algorithm. (1) Feature selection algorithm based on
self-speculative information. \textit{SpecDiff} determines a dynamic importance
score for each token based on self-speculative information and historical
information, and performs cached feature selection through the importance
score. (2) Multi-level feature classification algorithm based on feature
importance scores. \textit{SpecDiff} classifies tokens by leveraging the
differences in feature importance scores and introduces a multi-level feature
calculation strategy. Extensive experiments show that \textit{SpecDiff}
achieves average 2.80 \times, 2.74 \times , and 3.17\times speedup with
negligible quality loss in Stable Diffusion 3, 3.5, and FLUX compared to RFlow
on NVIDIA A800-80GB GPU. By merging speculative and historical information,
\textit{SpecDiff} overcomes the speedup-accuracy trade-off bottleneck, pushing
the Pareto frontier of speedup and accuracy in the efficient diffusion model
inference.
☆ Consistent View Alignment Improves Foundation Models for 3D Medical Image Segmentation MICCAI 2025
Many recent approaches in representation learning implicitly assume that
uncorrelated views of a data point are sufficient to learn meaningful
representations for various downstream tasks. In this work, we challenge this
assumption and demonstrate that meaningful structure in the latent space does
not emerge naturally. Instead, it must be explicitly induced. We propose a
method that aligns representations from different views of the data to align
complementary information without inducing false positives. Our experiments
show that our proposed self-supervised learning method, Consistent View
Alignment, improves performance for downstream tasks, highlighting the critical
role of structured view alignment in learning effective representations. Our
method achieved first and second place in the MICCAI 2025 SSL3D challenge when
using a Primus vision transformer and ResEnc convolutional neural network,
respectively. The code and pretrained model weights are released at
https://github.com/Tenbatsu24/LatentCampus.
comment: MICCAI 2025: 1st Place in Transformer track and 2nd Place in
Convolution track of SSL3D-OpenMind challenge
☆ Diving into Mitigating Hallucinations from a Vision Perspective for Large Vision-Language Models EMNLP2025
Object hallucination in Large Vision-Language Models (LVLMs) significantly
impedes their real-world applicability. As the primary component for accurately
interpreting visual information, the choice of visual encoder is pivotal. We
hypothesize that the diverse training paradigms employed by different visual
encoders instill them with distinct inductive biases, which leads to their
diverse hallucination performances. Existing benchmarks typically focus on
coarse-grained hallucination detection and fail to capture the diverse
hallucinations elaborated in our hypothesis. To systematically analyze these
effects, we introduce VHBench-10, a comprehensive benchmark with approximately
10,000 samples for evaluating LVLMs across ten fine-grained hallucination
categories. Our evaluations confirm encoders exhibit unique hallucination
characteristics. Building on these insights and the suboptimality of simple
feature fusion, we propose VisionWeaver, a novel Context-Aware Routing Network.
It employs global visual features to generate routing signals, dynamically
aggregating visual features from multiple specialized experts. Comprehensive
experiments confirm the effectiveness of VisionWeaver in significantly reducing
hallucinations and improving overall model performance.
comment: Accepted by EMNLP2025 Finding
☆ Semi-MoE: Mixture-of-Experts meets Semi-Supervised Histopathology Segmentation BMVC 2025
Nguyen Lan Vi Vu, Thanh-Huy Nguyen, Thien Nguyen, Daisuke Kihara, Tianyang Wang, Xingjian Li, Min Xu
Semi-supervised learning has been employed to alleviate the need for
extensive labeled data for histopathology image segmentation, but existing
methods struggle with noisy pseudo-labels due to ambiguous gland boundaries and
morphological misclassification. This paper introduces Semi-MOE, to the best of
our knowledge, the first multi-task Mixture-of-Experts framework for
semi-supervised histopathology image segmentation. Our approach leverages three
specialized expert networks: A main segmentation expert, a signed distance
field regression expert, and a boundary prediction expert, each dedicated to
capturing distinct morphological features. Subsequently, the Multi-Gating
Pseudo-labeling module dynamically aggregates expert features, enabling a
robust fuse-and-refine pseudo-labeling mechanism. Furthermore, to eliminate
manual tuning while dynamically balancing multiple learning objectives, we
propose an Adaptive Multi-Objective Loss. Extensive experiments on GlaS and
CRAG benchmarks show that our method outperforms state-of-the-art approaches in
low-label settings, highlighting the potential of MoE-based architectures in
advancing semi-supervised segmentation. Our code is available at
https://github.com/vnlvi2k3/Semi-MoE.
comment: Accepted to BMVC 2025
☆ Data-Efficient Spectral Classification of Hyperspectral Data Using MiniROCKET and HDC-MiniROCKET IEEE
The classification of pixel spectra of hyperspectral images, i.e. spectral
classification, is used in many fields ranging from agricultural, over medical
to remote sensing applications and is currently also expanding to areas such as
autonomous driving. Even though for full hyperspectral images the
best-performing methods exploit spatial-spectral information, performing
classification solely on spectral information has its own advantages, e.g.
smaller model size and thus less data required for training. Moreover, spectral
information is complementary to spatial information and improvements on either
part can be used to improve spatial-spectral approaches in the future.
Recently, 1D-Justo-LiuNet was proposed as a particularly efficient model with
very few parameters, which currently defines the state of the art in spectral
classification. However, we show that with limited training data the model
performance deteriorates. Therefore, we investigate MiniROCKET and
HDC-MiniROCKET for spectral classification to mitigate that problem. The model
extracts well-engineered features without trainable parameters in the feature
extraction part and is therefore less vulnerable to limited training data. We
show that even though MiniROCKET has more parameters it outperforms
1D-Justo-LiuNet in limited data scenarios and is mostly on par with it in the
general case
comment: Accepted for publication at IEEE CASE 2025
☆ Masked Feature Modeling Enhances Adaptive Segmentation
Unsupervised domain adaptation (UDA) for semantic segmentation aims to
transfer models from a labeled source domain to an unlabeled target domain.
While auxiliary self-supervised tasks-particularly contrastive learning-have
improved feature discriminability, masked modeling approaches remain
underexplored in this setting, largely due to architectural incompatibility and
misaligned optimization objectives. We propose Masked Feature Modeling (MFM), a
novel auxiliary task that performs feature masking and reconstruction directly
in the feature space. Unlike existing masked modeling methods that reconstruct
low-level inputs or perceptual features (e.g., HOG or visual tokens), MFM
aligns its learning target with the main segmentation task, ensuring
compatibility with standard architectures like DeepLab and DAFormer without
modifying the inference pipeline. To facilitate effective reconstruction, we
introduce a lightweight auxiliary module, Rebuilder, which is trained jointly
but discarded during inference, adding zero computational overhead at test
time. Crucially, MFM leverages the segmentation decoder to classify the
reconstructed features, tightly coupling the auxiliary objective with the
pixel-wise prediction task to avoid interference with the primary task.
Extensive experiments across various architectures and UDA benchmarks
demonstrate that MFM consistently enhances segmentation performance, offering a
simple, efficient, and generalizable strategy for unsupervised domain-adaptive
semantic segmentation.
☆ SWA-PF: Semantic-Weighted Adaptive Particle Filter for Memory-Efficient 4-DoF UAV Localization in GNSS-Denied Environments
Vision-based Unmanned Aerial Vehicle (UAV) localization systems have been
extensively investigated for Global Navigation Satellite System (GNSS)-denied
environments. However, existing retrieval-based approaches face limitations in
dataset availability and persistent challenges including suboptimal real-time
performance, environmental sensitivity, and limited generalization capability,
particularly in dynamic or temporally varying environments. To overcome these
limitations, we present a large-scale Multi-Altitude Flight Segments dataset
(MAFS) for variable altitude scenarios and propose a novel Semantic-Weighted
Adaptive Particle Filter (SWA-PF) method. This approach integrates robust
semantic features from both UAV-captured images and satellite imagery through
two key innovations: a semantic weighting mechanism and an optimized particle
filtering architecture. Evaluated using our dataset, the proposed method
achieves 10x computational efficiency gain over feature extraction methods,
maintains global positioning errors below 10 meters, and enables rapid 4 degree
of freedom (4-DoF) pose estimation within seconds using accessible
low-resolution satellite maps. Code and dataset will be available at
https://github.com/YuanJiayuuu/SWA-PF.
☆ Bridging the Synthetic-Real Gap: Supervised Domain Adaptation for Robust Spacecraft 6-DoF Pose Estimation
Spacecraft Pose Estimation (SPE) is a fundamental capability for autonomous
space operations such as rendezvous, docking, and in-orbit servicing. Hybrid
pipelines that combine object detection, keypoint regression, and
Perspective-n-Point (PnP) solvers have recently achieved strong results on
synthetic datasets, yet their performance deteriorates sharply on real or
lab-generated imagery due to the persistent synthetic-to-real domain gap.
Existing unsupervised domain adaptation approaches aim to mitigate this issue
but often underperform when a modest number of labeled target samples are
available. In this work, we propose the first Supervised Domain Adaptation
(SDA) framework tailored for SPE keypoint regression. Building on the Learning
Invariant Representation and Risk (LIRR) paradigm, our method jointly optimizes
domain-invariant representations and task-specific risk using both labeled
synthetic and limited labeled real data, thereby reducing generalization error
under domain shift. Extensive experiments on the SPEED+ benchmark demonstrate
that our approach consistently outperforms source-only, fine-tuning, and oracle
baselines. Notably, with only 5% labeled target data, our method matches or
surpasses oracle performance trained on larger fractions of labeled data. The
framework is lightweight, backbone-agnostic, and computationally efficient,
offering a practical pathway toward robust and deployable spacecraft pose
estimation in real-world space environments.
☆ BWCache: Accelerating Video Diffusion Transformers through Block-Wise Caching
Recent advancements in Diffusion Transformers (DiTs) have established them as
the state-of-the-art method for video generation. However, their inherently
sequential denoising process results in inevitable latency, limiting real-world
applicability. Existing acceleration methods either compromise visual quality
due to architectural modifications or fail to reuse intermediate features at
proper granularity. Our analysis reveals that DiT blocks are the primary
contributors to inference latency. Across diffusion timesteps, the feature
variations of DiT blocks exhibit a U-shaped pattern with high similarity during
intermediate timesteps, which suggests substantial computational redundancy. In
this paper, we propose Block-Wise Caching (BWCache), a training-free method to
accelerate DiT-based video generation. BWCache dynamically caches and reuses
features from DiT blocks across diffusion timesteps. Furthermore, we introduce
a similarity indicator that triggers feature reuse only when the differences
between block features at adjacent timesteps fall below a threshold, thereby
minimizing redundant computations while maintaining visual fidelity. Extensive
experiments on several video diffusion models demonstrate that BWCache achieves
up to 2.24$\times$ speedup with comparable visual quality.
☆ CETUS: Causal Event-Driven Temporal Modeling With Unified Variable-Rate Scheduling
Event cameras capture asynchronous pixel-level brightness changes with
microsecond temporal resolution, offering unique advantages for high-speed
vision tasks. Existing methods often convert event streams into intermediate
representations such as frames, voxel grids, or point clouds, which inevitably
require predefined time windows and thus introduce window latency. Meanwhile,
pointwise detection methods face computational challenges that prevent
real-time efficiency due to their high computational cost. To overcome these
limitations, we propose the Variable-Rate Spatial Event Mamba, a novel
architecture that directly processes raw event streams without intermediate
representations. Our method introduces a lightweight causal spatial
neighborhood encoder to efficiently capture local geometric relations, followed
by Mamba-based state space models for scalable temporal modeling with linear
complexity. During inference, a controller adaptively adjusts the processing
speed according to the event rate, achieving an optimal balance between window
latency and inference latency.
comment: 8 pages, 6 figures
☆ Morphology-optimized Multi-Scale Fusion: Combining Local Artifacts and Mesoscopic Semantics for Deepfake Detection and Localization IJCAI 2025
Chao Shuai, Gaojian Wang, Kun Pan, Tong Wu, Fanli Jin, Haohan Tan, Mengxiang Li, Zhenguang Liu, Feng Lin, Kui Ren
While the pursuit of higher accuracy in deepfake detection remains a central
goal, there is an increasing demand for precise localization of manipulated
regions. Despite the remarkable progress made in classification-based
detection, accurately localizing forged areas remains a significant challenge.
A common strategy is to incorporate forged region annotations during model
training alongside manipulated images. However, such approaches often neglect
the complementary nature of local detail and global semantic context, resulting
in suboptimal localization performance. Moreover, an often-overlooked aspect is
the fusion strategy between local and global predictions. Naively combining the
outputs from both branches can amplify noise and errors, thereby undermining
the effectiveness of the localization.
To address these issues, we propose a novel approach that independently
predicts manipulated regions using both local and global perspectives. We
employ morphological operations to fuse the outputs, effectively suppressing
noise while enhancing spatial coherence. Extensive experiments reveal the
effectiveness of each module in improving the accuracy and robustness of
forgery localization.
comment: The 3rd Place, IJCAI 2025 Workshop on Deepfake Detection,
Localization, and Interpretability
☆ AdaThinkDrive: Adaptive Thinking via Reinforcement Learning for Autonomous Driving
Yuechen Luo, Fang Li, Shaoqing Xu, Zhiyi Lai, Lei Yang, Qimao Chen, Ziang Luo, Zixun Xie, Shengyin Jiang, Jiaxin Liu, Long Chen, Bing Wang, Zhi-xin Yang
While reasoning technology like Chain of Thought (CoT) has been widely
adopted in Vision Language Action (VLA) models, it demonstrates promising
capabilities in end to end autonomous driving. However, recent efforts to
integrate CoT reasoning often fall short in simple scenarios, introducing
unnecessary computational overhead without improving decision quality. To
address this, we propose AdaThinkDrive, a novel VLA framework with a dual mode
reasoning mechanism inspired by fast and slow thinking. First, our framework is
pretrained on large scale autonomous driving (AD) scenarios using both question
answering (QA) and trajectory datasets to acquire world knowledge and driving
commonsense. During supervised fine tuning (SFT), we introduce a two mode
dataset, fast answering (w/o CoT) and slow thinking (with CoT), enabling the
model to distinguish between scenarios that require reasoning. Furthermore, an
Adaptive Think Reward strategy is proposed in conjunction with the Group
Relative Policy Optimization (GRPO), which rewards the model for selectively
applying CoT by comparing trajectory quality across different reasoning modes.
Extensive experiments on the Navsim benchmark show that AdaThinkDrive achieves
a PDMS of 90.3, surpassing the best vision only baseline by 1.7 points.
Moreover, ablations show that AdaThinkDrive surpasses both the never Think and
always Think baselines, improving PDMS by 2.0 and 1.4, respectively. It also
reduces inference time by 14% compared to the always Think baseline,
demonstrating its ability to balance accuracy and efficiency through adaptive
reasoning.
☆ Generative Image Coding with Diffusion Prior
As generative technologies advance, visual content has evolved into a complex
mix of natural and AI-generated images, driving the need for more efficient
coding techniques that prioritize perceptual quality. Traditional codecs and
learned methods struggle to maintain subjective quality at high compression
ratios, while existing generative approaches face challenges in visual fidelity
and generalization. To this end, we propose a novel generative coding framework
leveraging diffusion priors to enhance compression performance at low bitrates.
Our approach employs a pre-optimized encoder to generate generalized
compressed-domain representations, integrated with the pretrained model's
internal features via a lightweight adapter and an attentive fusion module.
This framework effectively leverages existing pretrained diffusion models and
enables efficient adaptation to different pretrained models for new
requirements with minimal retraining costs. We also introduce a distribution
renormalization method to further enhance reconstruction fidelity. Extensive
experiments show that our method (1) outperforms existing methods in visual
fidelity across low bitrates, (2) improves compression performance by up to 79%
over H.266/VVC, and (3) offers an efficient solution for AI-generated content
while being adaptable to broader content types.
☆ VocSegMRI: Multimodal Learning for Precise Vocal Tract Segmentation in Real-time MRI ICASSP
Daiqi Liu, Tomás Arias-Vergara, Johannes Enk, Fangxu Xing, Maureen Stone, Jerry L. Prince, Jana Hutter, Andreas Maier, Jonghye Woo, Paula Andrea Pérez-Toro
Accurately segmenting articulatory structures in real-time magnetic resonance
imaging (rtMRI) remains challenging, as most existing methods rely almost
entirely on visual cues. Yet synchronized acoustic and phonological signals
provide complementary context that can enrich visual information and improve
precision. In this paper, we introduce VocSegMRI, a multimodal framework that
integrates video, audio, and phonological inputs through cross-attention fusion
for dynamic feature alignment. To further enhance cross-modal representation,
we incorporate a contrastive learning objective that improves segmentation
performance even when the audio modality is unavailable at inference. Evaluated
on a sub-set of USC-75 rtMRI dataset, our approach achieves state-of-the-art
performance, with a Dice score of 0.95 and a 95th percentile Hausdorff Distance
(HD_95) of 4.20 mm, outperforming both unimodal and multimodal baselines.
Ablation studies confirm the contributions of cross-attention and contrastive
learning to segmentation precision and robustness. These results highlight the
value of integrative multimodal modeling for accurate vocal tract analysis.
comment: Preprint submitted to ICASSP
☆ NDLPNet: A Location-Aware Nighttime Deraining Network and a Real-World Benchmark Dataset
Visual degradation caused by rain streak artifacts in low-light conditions
significantly hampers the performance of nighttime surveillance and autonomous
navigation. Existing image deraining techniques are primarily designed for
daytime conditions and perform poorly under nighttime illumination due to the
spatial heterogeneity of rain distribution and the impact of light-dependent
stripe visibility. In this paper, we propose a novel Nighttime Deraining
Location-enhanced Perceptual Network(NDLPNet) that effectively captures the
spatial positional information and density distribution of rain streaks in
low-light environments. Specifically, we introduce a Position Perception Module
(PPM) to capture and leverage spatial contextual information from input data,
enhancing the model's capability to identify and recalibrate the importance of
different feature channels. The proposed nighttime deraining network can
effectively remove the rain streaks as well as preserve the crucial background
information. Furthermore, We construct a night scene rainy (NSR) dataset
comprising 900 image pairs, all based on real-world nighttime scenes, providing
a new benchmark for nighttime deraining task research. Extensive qualitative
and quantitative experimental evaluations on both existing datasets and the NSR
dataset consistently demonstrate our method outperform the state-of-the-art
(SOTA) methods in nighttime deraining tasks. The source code and dataset is
available at https://github.com/Feecuin/NDLPNet.
☆ Task-Aware Image Signal Processor for Advanced Visual Perception
In recent years, there has been a growing trend in computer vision towards
exploiting RAW sensor data, which preserves richer information compared to
conventional low-bit RGB images. Early studies mainly focused on enhancing
visual quality, while more recent efforts aim to leverage the abundant
information in RAW data to improve the performance of visual perception tasks
such as object detection and segmentation. However, existing approaches still
face two key limitations: large-scale ISP networks impose heavy computational
overhead, while methods based on tuning traditional ISP pipelines are
restricted by limited representational capacity.To address these issues, we
propose Task-Aware Image Signal Processing (TA-ISP), a compact RAW-to-RGB
framework that produces task-oriented representations for pretrained vision
models. Instead of heavy dense convolutional pipelines, TA-ISP predicts a small
set of lightweight, multi-scale modulation operators that act at global,
regional, and pixel scales to reshape image statistics across different spatial
extents. This factorized control significantly expands the range of spatially
varying transforms that can be represented while keeping memory usage,
computation, and latency tightly constrained. Evaluated on several RAW-domain
detection and segmentation benchmarks under both daytime and nighttime
conditions, TA-ISP consistently improves downstream accuracy while markedly
reducing parameter count and inference time, making it well suited for
deployment on resource-constrained devices.
☆ Iterative Prompt Refinement for Safer Text-to-Image Generation
Text-to-Image (T2I) models have made remarkable progress in generating images
from text prompts, but their output quality and safety still depend heavily on
how prompts are phrased. Existing safety methods typically refine prompts using
large language models (LLMs), but they overlook the images produced, which can
result in unsafe outputs or unnecessary changes to already safe prompts. To
address this, we propose an iterative prompt refinement algorithm that uses
Vision Language Models (VLMs) to analyze both the input prompts and the
generated images. By leveraging visual feedback, our method refines prompts
more effectively, improving safety while maintaining user intent and
reliability comparable to existing LLM-based approaches. Additionally, we
introduce a new dataset labeled with both textual and visual safety signals
using off-the-shelf multi-modal LLM, enabling supervised fine-tuning.
Experimental results demonstrate that our approach produces safer outputs
without compromising alignment with user intent, offering a practical solution
for generating safer T2I content. Our code is available at
https://github.com/ku-dmlab/IPR. \textbf{\textcolor{red}WARNING: This paper
contains examples of harmful or inappropriate images generated by models.
☆ Controllable-Continuous Color Editing in Diffusion Model via Color Mapping
In recent years, text-driven image editing has made significant progress.
However, due to the inherent ambiguity and discreteness of natural language,
color editing still faces challenges such as insufficient precision and
difficulty in achieving continuous control. Although linearly interpolating the
embedding vectors of different textual descriptions can guide the model to
generate a sequence of images with varying colors, this approach lacks precise
control over the range of color changes in the output images. Moreover, the
relationship between the interpolation coefficient and the resulting image
color is unknown and uncontrollable. To address these issues, we introduce a
color mapping module that explicitly models the correspondence between the text
embedding space and image RGB values. This module predicts the corresponding
embedding vector based on a given RGB value, enabling precise color control of
the generated images while maintaining semantic consistency. Users can specify
a target RGB range to generate images with continuous color variations within
the desired range, thereby achieving finer-grained, continuous, and
controllable color editing. Experimental results demonstrate that our method
performs well in terms of color continuity and controllability.
☆ Cross-modal Full-mode Fine-grained Alignment for Text-to-Image Person Retrieval
Text-to-Image Person Retrieval (TIPR) is a cross-modal matching task that
aims to retrieve the most relevant person images based on a given text query.
The key challenge in TIPR lies in achieving effective alignment between textual
and visual modalities within a common latent space. To address this challenge,
prior approaches incorporate attention mechanisms for implicit cross-modal
local alignment. However, they lack the ability to verify whether all local
features are correctly aligned. Moreover, existing methods primarily focus on
hard negative samples during model updates, with the goal of refining
distinctions between positive and negative pairs, often neglecting incorrectly
matched positive pairs. To alleviate these issues, we propose FMFA, a
cross-modal Full-Mode Fine-grained Alignment framework, which enhances global
matching through explicit fine-grained alignment and existing implicit
relational reasoning -- hence the term ``full-mode" -- without requiring
additional supervision. Specifically, we design an Adaptive Similarity
Distribution Matching (A-SDM) module to rectify unmatched positive sample
pairs. A-SDM adaptively pulls the unmatched positive pairs closer in the joint
embedding space, thereby achieving more precise global alignment. Additionally,
we introduce an Explicit Fine-grained Alignment (EFA) module, which makes up
for the lack of verification capability of implicit relational reasoning. EFA
strengthens explicit cross-modal fine-grained interactions by sparsifying the
similarity matrix and employs a hard coding method for local alignment. Our
proposed method is evaluated on three public datasets, achieving
state-of-the-art performance among all global matching methods. Our code is
available at https://github.com/yinhao1102/FMFA.
☆ Improving Generalized Visual Grounding with Instance-aware Joint Learning IEEE
Generalized visual grounding tasks, including Generalized Referring
Expression Comprehension (GREC) and Segmentation (GRES), extend the classical
visual grounding paradigm by accommodating multi-target and non-target
scenarios. Specifically, GREC focuses on accurately identifying all referential
objects at the coarse bounding box level, while GRES aims for achieve
fine-grained pixel-level perception. However, existing approaches typically
treat these tasks independently, overlooking the benefits of jointly training
GREC and GRES to ensure consistent multi-granularity predictions and streamline
the overall process. Moreover, current methods often treat GRES as a semantic
segmentation task, neglecting the crucial role of instance-aware capabilities
and the necessity of ensuring consistent predictions between instance-level
boxes and masks. To address these limitations, we propose InstanceVG, a
multi-task generalized visual grounding framework equipped with instance-aware
capabilities, which leverages instance queries to unify the joint and
consistency predictions of instance-level boxes and masks. To the best of our
knowledge, InstanceVG is the first framework to simultaneously tackle both GREC
and GRES while incorporating instance-aware capabilities into generalized
visual grounding. To instantiate the framework, we assign each instance query a
prior reference point, which also serves as an additional basis for target
matching. This design facilitates consistent predictions of points, boxes, and
masks for the same instance. Extensive experiments obtained on ten datasets
across four tasks demonstrate that InstanceVG achieves state-of-the-art
performance, significantly surpassing the existing methods in various
evaluation metrics. The code and model will be publicly available at
https://github.com/Dmmm1997/InstanceVG.
comment: Accepted by IEEE Transactions on Pattern Analysis and Machine
Intelligence (TPAMI) in September 2025
☆ Mitigating Query Selection Bias in Referring Video Object Segmentation
Recently, query-based methods have achieved remarkable performance in
Referring Video Object Segmentation (RVOS) by using textual static object
queries to drive cross-modal alignment. However, these static queries are
easily misled by distractors with similar appearance or motion, resulting in
\emph{query selection bias}. To address this issue, we propose Triple Query
Former (TQF), which factorizes the referring query into three specialized
components: an appearance query for static attributes, an intra-frame
interaction query for spatial relations, and an inter-frame motion query for
temporal association. Instead of relying solely on textual embeddings, our
queries are dynamically constructed by integrating both linguistic cues and
visual guidance. Furthermore, we introduce two motion-aware aggregation modules
that enhance object token representations: Intra-frame Interaction Aggregation
incorporates position-aware interactions among objects within a single frame,
while Inter-frame Motion Aggregation leverages trajectory-guided alignment
across frames to ensure temporal coherence. Extensive experiments on multiple
RVOS benchmarks demonstrate the advantages of TQF and the effectiveness of our
structured query design and motion-aware aggregation modules.
☆ UM-Depth : Uncertainty Masked Self-Supervised Monocular Depth Estimation with Visual Odometry
Monocular depth estimation has been increasingly adopted in robotics and
autonomous driving for its ability to infer scene geometry from a single
camera. In self-supervised monocular depth estimation frameworks, the network
jointly generates and exploits depth and pose estimates during training,
thereby eliminating the need for depth labels. However, these methods remain
challenged by uncertainty in the input data, such as low-texture or dynamic
regions, which can cause reduced depth accuracy. To address this, we introduce
UM-Depth, a framework that combines motion- and uncertainty-aware refinement to
enhance depth accuracy at dynamic object boundaries and in textureless regions.
Specifically, we develop a teacherstudent training strategy that embeds
uncertainty estimation into both the training pipeline and network
architecture, thereby strengthening supervision where photometric signals are
weak. Unlike prior motion-aware approaches that incur inference-time overhead
and rely on additional labels or auxiliary networks for real-time generation,
our method uses optical flow exclusively within the teacher network during
training, which eliminating extra labeling demands and any runtime cost.
Extensive experiments on the KITTI and Cityscapes datasets demonstrate the
effectiveness of our uncertainty-aware refinement. Overall, UM-Depth achieves
state-of-the-art results in both self-supervised depth and pose estimation on
the KITTI datasets.
☆ StyleProtect: Safeguarding Artistic Identity in Fine-tuned Diffusion Models
The rapid advancement of generative models, particularly diffusion-based
approaches, has inadvertently facilitated their potential for misuse. Such
models enable malicious exploiters to replicate artistic styles that capture an
artist's creative labor, personal vision, and years of dedication in an
inexpensive manner. This has led to a rise in the need and exploration of
methods for protecting artworks against style mimicry. Although generic
diffusion models can easily mimic an artistic style, finetuning amplifies this
capability, enabling the model to internalize and reproduce the style with
higher fidelity and control. We hypothesize that certain cross-attention layers
exhibit heightened sensitivity to artistic styles. Sensitivity is measured
through activation strengths of attention layers in response to style and
content representations, and assessing their correlations with features
extracted from external models. Based on our findings, we introduce an
efficient and lightweight protection strategy, StyleProtect, that achieves
effective style defense against fine-tuned diffusion models by updating only
selected cross-attention layers. Our experiments utilize a carefully curated
artwork dataset based on WikiArt, comprising representative works from 30
artists known for their distinctive and influential styles and cartoon
animations from the Anita dataset. The proposed method demonstrates promising
performance in safeguarding unique styles of artworks and anime from malicious
diffusion customization, while maintaining competitive imperceptibility.
☆ Taylor-Series Expanded Kolmogorov-Arnold Network for Medical Imaging Classification
Effective and interpretable classification of medical images is a challenge
in computer-aided diagnosis, especially in resource-limited clinical settings.
This study introduces spline-based Kolmogorov-Arnold Networks (KANs) for
accurate medical image classification with limited, diverse datasets. The
models include SBTAYLOR-KAN, integrating B-splines with Taylor series;
SBRBF-KAN, combining B-splines with Radial Basis Functions; and SBWAVELET-KAN,
embedding B-splines in Morlet wavelet transforms. These approaches leverage
spline-based function approximation to capture both local and global
nonlinearities. The models were evaluated on brain MRI, chest X-rays,
tuberculosis X-rays, and skin lesion images without preprocessing,
demonstrating the ability to learn directly from raw data. Extensive
experiments, including cross-dataset validation and data reduction analysis,
showed strong generalization and stability. SBTAYLOR-KAN achieved up to 98.93%
accuracy, with a balanced F1-score, maintaining over 86% accuracy using only
30% of the training data across three datasets. Despite class imbalance in the
skin cancer dataset, experiments on both imbalanced and balanced versions
showed SBTAYLOR-KAN outperforming other models, achieving 68.22% accuracy.
Unlike traditional CNNs, which require millions of parameters (e.g., ResNet50
with 24.18M), SBTAYLOR-KAN achieves comparable performance with just 2,872
trainable parameters, making it more suitable for constrained medical
environments. Gradient-weighted Class Activation Mapping (Grad-CAM) was used
for interpretability, highlighting relevant regions in medical images. This
framework provides a lightweight, interpretable, and generalizable solution for
medical image classification, addressing the challenges of limited datasets and
data-scarce scenarios in clinical AI applications.
☆ FishBEV: Distortion-Resilient Bird's Eye View Segmentation with Surround-View Fisheye Cameras
As a cornerstone technique for autonomous driving, Bird's Eye View (BEV)
segmentation has recently achieved remarkable progress with pinhole cameras.
However, it is non-trivial to extend the existing methods to fisheye cameras
with severe geometric distortion, ambiguous multi-view correspondences and
unstable temporal dynamics, all of which significantly degrade BEV performance.
To address these challenges, we propose FishBEV, a novel BEV segmentation
framework specifically tailored for fisheye cameras. This framework introduces
three complementary innovations, including a Distortion-Resilient Multi-scale
Extraction (DRME) backbone that learns robust features under distortion while
preserving scale consistency, an Uncertainty-aware Spatial Cross-Attention
(U-SCA) mechanism that leverages uncertainty estimation for reliable cross-view
alignment, a Distance-aware Temporal Self-Attention (D-TSA) module that
adaptively balances near field details and far field context to ensure temporal
coherence. Extensive experiments on the Synwoodscapes dataset demonstrate that
FishBEV consistently outperforms SOTA baselines, regarding the performance
evaluation of FishBEV on the surround-view fisheye BEV segmentation tasks.
comment: 8 pages, 4 figures
☆ Re-purposing SAM into Efficient Visual Projectors for MLLM-Based Referring Image Segmentation
Recently, Referring Image Segmentation (RIS) frameworks that pair the
Multimodal Large Language Model (MLLM) with the Segment Anything Model (SAM)
have achieved impressive results. However, adapting MLLM to segmentation is
computationally intensive, primarily due to visual token redundancy. We observe
that traditional patch-wise visual projectors struggle to strike a balance
between reducing the number of visual tokens and preserving semantic clarity,
often retaining overly long token sequences to avoid performance drops.
Inspired by text tokenizers, we propose a novel semantic visual projector that
leverages semantic superpixels generated by SAM to identify "visual words" in
an image. By compressing and projecting semantic superpixels as visual tokens,
our approach adaptively shortens the token sequence according to scene
complexity while minimizing semantic loss in compression. To mitigate loss of
information, we propose a semantic superpixel positional embedding to
strengthen MLLM's awareness of superpixel geometry and position, alongside a
semantic superpixel aggregator to preserve both fine-grained details inside
superpixels and global context outside. Experiments show that our method cuts
visual tokens by 93% without compromising performance, notably speeding up MLLM
training and inference, and outperforming existing compressive visual
projectors on RIS.
☆ Deep Lookup Network
Convolutional neural networks are constructed with massive operations with
different types and are highly computationally intensive. Among these
operations, multiplication operation is higher in computational complexity and
usually requires {more} energy consumption with longer inference time than
other operations, which hinders the deployment of convolutional neural networks
on mobile devices. In many resource-limited edge devices, complicated
operations can be calculated via lookup tables to reduce computational cost.
Motivated by this, in this paper, we introduce a generic and efficient lookup
operation which can be used as a basic operation for the construction of neural
networks. Instead of calculating the multiplication of weights and activation
values, simple yet efficient lookup operations are adopted to compute their
responses. To enable end-to-end optimization of the lookup operation, we
construct the lookup tables in a differentiable manner and propose several
training strategies to promote their convergence. By replacing computationally
expensive multiplication operations with our lookup operations, we develop
lookup networks for the image classification, image super-resolution, and point
cloud classification tasks. It is demonstrated that our lookup networks can
benefit from the lookup operations to achieve higher efficiency in terms of
energy consumption and inference speed while maintaining competitive
performance to vanilla convolutional networks. Extensive experiments show that
our lookup networks produce state-of-the-art performance on different tasks
(both classification and regression tasks) and different data types (both
images and point clouds).
☆ Gaussian Alignment for Relative Camera Pose Estimation via Single-View Reconstruction
Estimating metric relative camera pose from a pair of images is of great
importance for 3D reconstruction and localisation. However, conventional
two-view pose estimation methods are not metric, with camera translation known
only up to a scale, and struggle with wide baselines and textureless or
reflective surfaces. This paper introduces GARPS, a training-free framework
that casts this problem as the direct alignment of two independently
reconstructed 3D scenes. GARPS leverages a metric monocular depth estimator and
a Gaussian scene reconstructor to obtain a metric 3D Gaussian Mixture Model
(GMM) for each image. It then refines an initial pose from a feed-forward
two-view pose estimator by optimising a differentiable GMM alignment objective.
This objective jointly considers geometric structure, view-independent colour,
anisotropic covariance, and semantic feature consistency, and is robust to
occlusions and texture-poor regions without requiring explicit 2D
correspondences. Extensive experiments on the Real\-Estate10K dataset
demonstrate that GARPS outperforms both classical and state-of-the-art
learning-based methods, including MASt3R. These results highlight the potential
of bridging single-view perception with multi-view geometry to achieve robust
and metric relative pose estimation.
comment: 12 pages, 4 figures, accepted by AJCAI 2025
☆ LLM-I: LLMs are Naturally Interleaved Multimodal Creators
We propose LLM-Interleaved (LLM-I), a flexible and dynamic framework that
reframes interleaved image-text generation as a tool-use problem. LLM-I is
designed to overcome the "one-tool" bottleneck of current unified models, which
are limited to synthetic imagery and struggle with tasks requiring factual
grounding or programmatic precision. Our framework empowers a central LLM or
MLLM agent to intelligently orchestrate a diverse toolkit of specialized visual
tools, including online image search, diffusion-based generation, code
execution, and image editing. The agent is trained to select and apply these
tools proficiently via a Reinforcement Learning (RL) framework that features a
hybrid reward system combining rule-based logic with judgments from LLM and
MLLM evaluators. Trained on a diverse new dataset using four different model
backbones, LLM-I demonstrates state-of-the-art performance, outperforming
existing methods by a large margin across four benchmarks. We also introduce a
novel test-time scaling strategy that provides further performance gains.
Project Page: https://github.com/ByteDance-BandAI/LLM-I.
☆ Federated Learning for Deforestation Detection: A Distributed Approach with Satellite Imagery IEEE
Accurate identification of deforestation from satellite images is essential
in order to understand the geographical situation of an area. This paper
introduces a new distributed approach to identify as well as locate
deforestation across different clients using Federated Learning (FL). Federated
Learning enables distributed network clients to collaboratively train a model
while maintaining data privacy and security of the active users. In our
framework, a client corresponds to an edge satellite center responsible for
local data processing. Moreover, FL provides an advantage over centralized
training method which requires combining data, thereby compromising with data
security of the clients. Our framework leverages the FLOWER framework with RAY
framework to execute the distributed learning workload. Furthermore, efficient
client spawning is ensured by RAY as it can select definite amount of users to
create an emulation environment. Our FL framework uses YOLOS-small (a Vision
Transformer variant), Faster R-CNN with a ResNet50 backbone, and Faster R-CNN
with a MobileNetV3 backbone models trained and tested on publicly available
datasets. Our approach provides us a different view for image
segmentation-based tasks on satellite imagery.
comment: 6 pages, 7 figures, accepted at IEEE INDISCON 2025
☆ SAMIR, an efficient registration framework via robust feature learning from SAM
Image registration is a fundamental task in medical image analysis.
Deformations are often closely related to the morphological characteristics of
tissues, making accurate feature extraction crucial. Recent weakly supervised
methods improve registration by incorporating anatomical priors such as
segmentation masks or landmarks, either as inputs or in the loss function.
However, such weak labels are often not readily available, limiting their
practical use. Motivated by the strong representation learning ability of
visual foundation models, this paper introduces SAMIR, an efficient medical
image registration framework that utilizes the Segment Anything Model (SAM) to
enhance feature extraction. SAM is pretrained on large-scale natural image
datasets and can learn robust, general-purpose visual representations. Rather
than using raw input images, we design a task-specific adaptation pipeline
using SAM's image encoder to extract structure-aware feature embeddings,
enabling more accurate modeling of anatomical consistency and deformation
patterns. We further design a lightweight 3D head to refine features within the
embedding space, adapting to local deformations in medical images.
Additionally, we introduce a Hierarchical Feature Consistency Loss to guide
coarse-to-fine feature matching and improve anatomical alignment. Extensive
experiments demonstrate that SAMIR significantly outperforms state-of-the-art
methods on benchmark datasets for both intra-subject cardiac image registration
and inter-subject abdomen CT image registration, achieving performance
improvements of 2.68% on ACDC and 6.44% on the abdomen dataset. The source code
will be publicly available on GitHub following the acceptance of this paper.
☆ Rest2Visual: Predicting Visually Evoked fMRI from Resting-State Scans
Understanding how spontaneous brain activity relates to stimulus-driven
neural responses is a fundamental challenge in cognitive neuroscience. While
task-based functional magnetic resonance imaging (fMRI) captures localized
stimulus-evoked brain activation, its acquisition is costly, time-consuming,
and difficult to scale across populations. In contrast, resting-state fMRI
(rs-fMRI) is task-free and abundant, but lacks direct interpretability. We
introduce Rest2Visual, a conditional generative model that predicts visually
evoked fMRI (ve-fMRI) from resting-state input and 2D visual stimuli. It
follows a volumetric encoder--decoder design, where multiscale 3D features from
rs-fMRI are modulated by image embeddings via adaptive normalization, enabling
spatially accurate, stimulus-specific activation synthesis. To enable model
training, we construct a large-scale triplet dataset from the Natural Scenes
Dataset (NSD), aligning each rs-fMRI volume with stimulus images and their
corresponding ve-fMRI activation maps. Quantitative evaluation shows that the
predicted activations closely match ground truth across standard similarity and
representational metrics, and support successful image reconstruction in
downstream decoding. Notably, the predicted maps preserve subject-specific
structure, demonstrating the model's capacity to generate individualized
functional surrogates. Our results provide compelling evidence that
individualized spontaneous neural activity can be transformed into
stimulus-aligned representations, opening new avenues for scalable, task-free
functional brain modeling.
☆ A Generalization of CLAP from 3D Localization to Image Processing, A Connection With RANSAC & Hough Transforms
In previous work, we introduced a 2D localization algorithm called CLAP,
Clustering to Localize Across $n$ Possibilities, which was used during our
championship win in RoboCup 2024, an international autonomous humanoid soccer
competition. CLAP is particularly recognized for its robustness against
outliers, where clustering is employed to suppress noise and mitigate against
erroneous feature matches. This clustering-based strategy provides an
alternative to traditional outlier rejection schemes such as RANSAC, in which
candidates are validated by reprojection error across all data points. In this
paper, CLAP is extended to a more general framework beyond 2D localization,
specifically to 3D localization and image stitching. We also show how CLAP,
RANSAC, and Hough transforms are related. The generalization of CLAP is widely
applicable to many different fields and can be a useful tool to deal with noise
and uncertainty.
♻ ☆ Effort-Optimized, Accuracy-Driven Labelling and Validation of Test Inputs for DL Systems: A Mixed-Integer Linear Programming Approach
Software systems increasingly include AI components based on deep learning
(DL). Reliable testing of such systems requires near-perfect test-input
validity and label accuracy, with minimal human effort. Yet, the DL community
has largely overlooked the need to build highly accurate datasets with minimal
effort, since DL training is generally tolerant of labelling errors. This
challenge, instead, reflects concerns more familiar to software engineering,
where a central goal is to construct high-accuracy test inputs, with accuracy
as close to 100% as possible, while keeping associated costs in check. In this
article we introduce OPAL, a human-assisted labelling method that can be
configured to target a desired accuracy level while minimizing the manual
effort required for labelling. The main contribution of OPAL is a mixed-integer
linear programming (MILP) formulation that minimizes labelling effort subject
to a specified accuracy target. To evaluate OPAL we instantiate it for two
tasks in the context of testing vision systems: automatic labelling of test
inputs and automated validation of test inputs. Our evaluation, based on more
than 2500 experiments performed on seven datasets, comparing OPAL with eight
baseline methods, shows that OPAL, relying on its MILP formulation, achieves an
average accuracy of 98.8%, while cutting manual labelling by more than half.
OPAL significantly outperforms automated labelling baselines in labelling
accuracy across all seven datasets, when all methods are provided with the same
manual-labelling budget. For automated test-input validation, on average, OPAL
reduces manual effort by 28.8% while achieving 4.5% higher accuracy than the
SOTA test-input validation baselines. Finally, we show that augmenting OPAL
with an active-learning loop leads to an additional 4.5% reduction in required
manual labelling, without compromising accuracy.
♻ ☆ Well-Conditioned Polynomial Representations for Mathematical Handwriting Recognition
Previous work has made use of a parameterized plane curve polynomial
representation for mathematical handwriting, with the polynomials represented
in a Legendre or Legendre-Sobolev graded basis. This provides a compact
geometric representation for the digital ink. Preliminary results have also
been shown for Chebyshev and Chebyshev-Sobolev bases. This article explores the
trade-offs between basis choice and polynomial degree to achieve accurate
modeling with a low computational cost. To do this, we consider the condition
number for polynomial evaluation in these bases and bound how the various inner
products give norms for the variations between symbols.
♻ ☆ Uni-cot: Towards Unified Chain-of-Thought Reasoning Across Text and Vision
Luozheng Qin, Jia Gong, Yuqing Sun, Tianjiao Li, Mengping Yang, Xiaomeng Yang, Chao Qu, Zhiyu Tan, Hao Li
Chain-of-Thought (CoT) reasoning has been widely adopted to enhance Large
Language Models (LLMs) by decomposing complex tasks into simpler, sequential
subtasks. However, extending CoT to vision-language reasoning tasks remains
challenging, as it often requires interpreting transitions of visual states to
support reasoning. Existing methods often struggle with this due to limited
capacity of modeling visual state transitions or incoherent visual trajectories
caused by fragmented architectures.
To overcome these limitations, we propose Uni-CoT, a Unified Chain-of-Thought
framework that enables coherent and grounded multimodal reasoning within a
single unified model. The key idea is to leverage a model capable of both image
understanding and generation to reason over visual content and model evolving
visual states. However, empowering a unified model to achieve that is
non-trivial, given the high computational cost and the burden of training. To
address this, Uni-CoT introduces a novel two-level reasoning paradigm: A
Macro-Level CoT for high-level task planning and A Micro-Level CoT for subtask
execution. This design significantly reduces the computational overhead.
Furthermore, we introduce a structured training paradigm that combines
interleaved image-text supervision for macro-level CoT with multi-task
objectives for micro-level CoT. Together, these innovations allow Uni-CoT to
perform scalable and coherent multi-modal reasoning. Furthermore, thanks to our
design, all experiments can be efficiently completed using only 8 A100 GPUs
with 80GB VRAM each. Experimental results on reasoning-driven image generation
benchmark (WISE) and editing benchmarks (RISE and KRIS) indicates that Uni-CoT
demonstrates SOTA performance and strong generalization, establishing Uni-CoT
as a promising solution for multi-modal reasoning. Project Page and Code:
https://sais-fuxi.github.io/projects/uni-cot/
comment: Project Page: https://sais-fuxi.github.io/projects/uni-cot/
♻ ☆ Synthesis and Perceptual Scaling of High Resolution Naturalistic Images Using Stable Diffusion
Naturalistic scenes are of key interest for visual perception, but
controlling their perceptual and semantic properties is challenging. Previous
work on naturalistic scenes has frequently focused on collections of discrete
images with considerable physical differences between stimuli. However, it is
often desirable to assess representations of naturalistic images that vary
along a continuum. Traditionally, perceptually continuous variations of
naturalistic stimuli have been obtained by morphing a source image into a
target image. This produces transitions driven mainly by low-level physical
features and can result in semantically ambiguous outcomes. More recently,
generative adversarial networks (GANs) have been used to generate continuous
perceptual variations within a stimulus category. Here we extend and generalize
this approach using a different machine learning approach, a text-to-image
diffusion model (Stable Diffusion XL), to generate a freely customizable
stimulus set of photorealistic images that are characterized by gradual
transitions, with each image representing a unique exemplar within a prompted
category. We demonstrate the approach by generating a set of 108 object scenes
from 6 categories. For each object scene, we generate 10 variants that are
ordered along a perceptual continuum. This ordering was first estimated using a
machine learning model of perceptual similarity (LPIPS) and then subsequently
validated with a large online sample of human participants. In a subsequent
experiment we show that this ordering is also predictive of confusability of
stimuli in a working memory experiment. Our image set is suited for studies
investigating the graded encoding of naturalistic stimuli in visual perception,
attention, and memory.
comment: 80 pages, 26 Figures, 6 tables
♻ ☆ A Deep Learning Pipeline for Solid Waste Detection in Remote Sensing Images
Federico Gibellini, Piero Fraternali, Giacomo Boracchi, Luca Morandini, Thomas Martinoli, Andrea Diecidue, Simona Malegori
Improper solid waste management represents both a serious threat to ecosystem
health and a significant source of revenues for criminal organizations
perpetrating environmental crimes. This issue can be mitigated thanks to the
increasing availability of Very-High-Resolution Remote Sensing (VHR RS) images.
Modern image-analysis tools support automated photo-interpretation and large
territory scanning in search of illegal waste disposal sites. This paper
illustrates a semi-automatic waste detection pipeline, developed in
collaboration with a regional environmental protection agency, for detecting
candidate illegal dumping sites in VHR RS images. To optimize the effectiveness
of the waste detector at the core of the pipeline, extensive experiments
evaluate such design choices as the network architecture, the ground resolution
and geographic span of the input images, as well as the pretraining procedures.
The best model attains remarkable performance, achieving 92.02 % F1-Score and
94.56 % Accuracy. A generalization study assesses the performance variation
when the detector processes images from various territories substantially
different from the one used during training, incurring only a moderate
performance loss, namely an average 5.1 % decrease in the F1-Score. Finally, an
exercise in which expert photo-interpreters compare the effort required to scan
large territories with and without support from the waste detector assesses the
practical benefit of introducing a computer-aided image analysis tool in a
professional environmental protection agency. Results show that a reduction of
up to 30 % of the time spent for waste site detection can be attained.
♻ ☆ StyleSculptor: Zero-Shot Style-Controllable 3D Asset Generation with Texture-Geometry Dual Guidance SIGGRAPH
Creating 3D assets that follow the texture and geometry style of existing
ones is often desirable or even inevitable in practical applications like video
gaming and virtual reality. While impressive progress has been made in
generating 3D objects from text or images, creating style-controllable 3D
assets remains a complex and challenging problem. In this work, we propose
StyleSculptor, a novel training-free approach for generating style-guided 3D
assets from a content image and one or more style images. Unlike previous
works, StyleSculptor achieves style-guided 3D generation in a zero-shot manner,
enabling fine-grained 3D style control that captures the texture, geometry, or
both styles of user-provided style images. At the core of StyleSculptor is a
novel Style Disentangled Attention (SD-Attn) module, which establishes a
dynamic interaction between the input content image and style image for
style-guided 3D asset generation via a cross-3D attention mechanism, enabling
stable feature fusion and effective style-guided generation. To alleviate
semantic content leakage, we also introduce a style-disentangled feature
selection strategy within the SD-Attn module, which leverages the variance of
3D feature patches to disentangle style- and content-significant channels,
allowing selective feature injection within the attention framework. With
SD-Attn, the network can dynamically compute texture-, geometry-, or
both-guided features to steer the 3D generation process. Built upon this, we
further propose the Style Guided Control (SGC) mechanism, which enables
exclusive geometry- or texture-only stylization, as well as adjustable style
intensity control. Extensive experiments demonstrate that StyleSculptor
outperforms existing baseline methods in producing high-fidelity 3D assets.
comment: SIGGRAPH Asia 2025, Project page:https://stylesculptor.github.io
♻ ☆ SCALP: Superpixels with Contour Adherence using Linear Path ICPR
Superpixel decomposition methods are generally used as a pre-processing step
to speed up image processing tasks. They group the pixels of an image into
homogeneous regions while trying to respect existing contours. For all
state-of-the-art superpixel decomposition methods, a trade-off is made between
1) computational time, 2) adherence to image contours and 3) regularity and
compactness of the decomposition. In this paper, we propose a fast method to
compute Superpixels with Contour Adherence using Linear Path (SCALP) in an
iterative clustering framework. The distance computed when trying to associate
a pixel to a superpixel during the clustering is enhanced by considering the
linear path to the superpixel barycenter. The proposed framework produces
regular and compact superpixels that adhere to the image contours. We provide a
detailed evaluation of SCALP on the standard Berkeley Segmentation Dataset. The
obtained results outperform state-of-the-art methods in terms of standard
superpixel and contour detection metrics.
comment: International Conference on Pattern Recognition (ICPR) 2016
♻ ☆ Noise2Ghost: Self-supervised deep convolutional reconstruction for ghost imaging
We present a new self-supervised deep-learning-based Ghost Imaging (GI)
reconstruction method, which provides unparalleled reconstruction performance
for noisy acquisitions among unsupervised methods. We present the supporting
mathematical framework and results from theoretical and real data use cases.
Self-supervision removes the need for clean reference data while offering
strong noise reduction. This provides the necessary tools for addressing
signal-to-noise ratio concerns for GI acquisitions in emerging and cutting-edge
low-light GI scenarios. Notable examples include micro- and nano-scale x-ray
emission imaging, e.g., x-ray fluorescence imaging of dose-sensitive samples.
Their applications include in-vivo and in-operando case studies for biological
samples and batteries.
♻ ☆ Locally Explaining Prediction Behavior via Gradual Interventions and Measuring Property Gradients WACV-2026
Deep learning models achieve high predictive performance but lack intrinsic
interpretability, hindering our understanding of the learned prediction
behavior. Existing local explainability methods focus on associations,
neglecting the causal drivers of model predictions. Other approaches adopt a
causal perspective but primarily provide global, model-level explanations.
However, for specific inputs, it's unclear whether globally identified factors
apply locally. To address this limitation, we introduce a novel framework for
local interventional explanations by leveraging recent advances in
image-to-image editing models. Our approach performs gradual interventions on
semantic properties to quantify the corresponding impact on a model's
predictions using a novel score, the expected property gradient magnitude. We
demonstrate the effectiveness of our approach through an extensive empirical
evaluation on a wide range of architectures and tasks. First, we validate it in
a synthetic scenario and demonstrate its ability to locally identify biases.
Afterward, we apply our approach to investigate medical skin lesion
classifiers, analyze network training dynamics, and study a pre-trained CLIP
model with real-life interventional data. Our results highlight the potential
of interventional explanations on the property level to reveal new insights
into the behavior of deep models.
comment: Accepted at WACV-2026, 45 pages, 39 figures, 15 tables
♻ ☆ Video-Foley: Two-Stage Video-To-Sound Generation via Temporal Event Condition For Foley Sound
Foley sound synthesis is crucial for multimedia production, enhancing user
experience by synchronizing audio and video both temporally and semantically.
Recent studies on automating this labor-intensive process through
video-to-sound generation face significant challenges. Systems lacking explicit
temporal features suffer from poor alignment and controllability, while
timestamp-based models require costly and subjective human annotation. We
propose Video-Foley, a video-to-sound system using Root Mean Square (RMS) as an
intuitive condition with semantic timbre prompts (audio or text). RMS, a
frame-level intensity envelope closely related to audio semantics, acts as a
temporal event feature to guide audio generation from video. The
annotation-free self-supervised learning framework consists of two stages,
Video2RMS and RMS2Sound, incorporating novel ideas including RMS discretization
and RMS-ControlNet with a pretrained text-to-audio model. Our extensive
evaluation shows that Video-Foley achieves state-of-the-art performance in
audio-visual alignment and controllability for sound timing, intensity, timbre,
and nuance. Source code, model weights and demos are available on our companion
website. (https://jnwnlee.github.io/video-foley-demo)
comment: Accepted at IEEE/ACM Transactions on Audio, Speech and Language
Processing (TASLP)
♻ ☆ Identity-Preserving Text-to-Video Generation Guided by Simple yet Effective Spatial-Temporal Decoupled Representations
Yuji Wang, Moran Li, Xiaobin Hu, Ran Yi, Jiangning Zhang, Han Feng, Weijian Cao, Yabiao Wang, Chengjie Wang, Lizhuang Ma
Identity-preserving text-to-video (IPT2V) generation, which aims to create
high-fidelity videos with consistent human identity, has become crucial for
downstream applications. However, current end-to-end frameworks suffer a
critical spatial-temporal trade-off: optimizing for spatially coherent layouts
of key elements (e.g., character identity preservation) often compromises
instruction-compliant temporal smoothness, while prioritizing dynamic realism
risks disrupting the spatial coherence of visual structures. To tackle this
issue, we propose a simple yet effective spatial-temporal decoupled framework
that decomposes representations into spatial features for layouts and temporal
features for motion dynamics. Specifically, our paper proposes a semantic
prompt optimization mechanism and stage-wise decoupled generation paradigm. The
former module decouples the prompt into spatial and temporal components.
Aligned with the subsequent stage-wise decoupled approach, the spatial prompts
guide the text-to-image (T2I) stage to generate coherent spatial features,
while the temporal prompts direct the sequential image-to-video (I2V) stage to
ensure motion consistency. Experimental results validate that our approach
achieves excellent spatiotemporal consistency, demonstrating outstanding
performance in identity preservation, text relevance, and video quality. By
leveraging this simple yet robust mechanism, our algorithm secures the
runner-up position in 2025 ACM MultiMedia Challenge.
♻ ☆ UniPLV: Towards Label-Efficient Open-World 3D Scene Understanding by Regional Visual Language Supervision
Yuru Wang, Pei Liu, Songtao Wang, Zehan Zhang, Xinyan Lu, Changwei Cai, Hao Li, Fu Liu, Peng Jia, Xianpeng Lang
Open-world 3D scene understanding is a critical challenge that involves
recognizing and distinguishing diverse objects and categories from 3D data,
such as point clouds, without relying on manual annotations. Traditional
methods struggle with this open-world task, especially due to the limitations
of constructing extensive point cloud-text pairs and handling multimodal data
effectively. In response to these challenges, we present UniPLV, a robust
framework that unifies point clouds, images, and text within a single learning
paradigm for comprehensive 3D scene understanding. UniPLV leverages images as a
bridge to co-embed 3D points with pre-aligned images and text in a shared
feature space, eliminating the need for labor-intensive point cloud-text pair
crafting. Our framework achieves precise multimodal alignment through two
innovative strategies: (i) Logit and feature distillation modules between
images and point clouds to enhance feature coherence; (ii) A vision-point
matching module that implicitly corrects 3D semantic predictions affected by
projection inaccuracies from points to pixels. To further boost performance, we
implement four task-specific losses alongside a two-stage training strategy.
Extensive experiments demonstrate that UniPLV significantly surpasses
state-of-the-art methods, with average improvements of 15.6% and 14.8% in
semantic segmentation for Base-Annotated and Annotation-Free tasks,
respectively. These results underscore UniPLV's efficacy in pushing the
boundaries of open-world 3D scene understanding. We will release the code to
support future research and development.
♻ ☆ Robust Shape Regularity Criteria for Superpixel Evaluation
Regular decompositions are necessary for most superpixel-based object
recognition or tracking applications. So far in the literature, the regularity
or compactness of a superpixel shape is mainly measured by its circularity. In
this work, we first demonstrate that such measure is not adapted for superpixel
evaluation, since it does not directly express regularity but circular
appearance. Then, we propose a new metric that considers several shape
regularity aspects: convexity, balanced repartition, and contour smoothness.
Finally, we demonstrate that our measure is robust to scale and noise and
enables to more relevantly compare superpixel methods.
comment: International Conference on Image Processing 2017
♻ ☆ Reconstruction and Reenactment Separated Method for Realistic Gaussian Head
In this paper, we explore a reconstruction and reenactment separated
framework for 3D Gaussians head, which requires only a single portrait image as
input to generate controllable avatar. Specifically, we developed a large-scale
one-shot gaussian head generator built upon WebSSL and employed a two-stage
training approach that significantly enhances the capabilities of
generalization and high-frequency texture reconstruction. During inference, an
ultra-lightweight gaussian avatar driven by control signals enables high
frame-rate rendering, achieving 90 FPS at a resolution of 512x512. We further
demonstrate that the proposed framework follows the scaling law, whereby
increasing the parameter scale of the reconstruction module leads to improved
performance. Moreover, thanks to the separation design, driving efficiency
remains unaffected. Finally, extensive quantitative and qualitative experiments
validate that our approach outperforms current state-of-the-art methods.
♻ ☆ MEGANet-W: A Wavelet-Driven Edge-Guided Attention Framework for Weak Boundary Polyp Detection IEEE
Colorectal polyp segmentation is critical for early detection of colorectal
cancer, yet weak and low contrast boundaries significantly limit automated
accuracy. Existing deep models either blur fine edge details or rely on
handcrafted filters that perform poorly under variable imaging conditions. We
propose MEGANet-W, a Wavelet Driven Edge Guided Attention Network that injects
directional, parameter free Haar wavelet edge maps into each decoder stage to
recalibrate semantic features. The key novelties of MEGANet-W include a
two-level Haar wavelet head for multi-orientation edge extraction; and Wavelet
Edge Guided Attention (W-EGA) modules that fuse wavelet cues with boundary and
input branches. On five public polyp datasets, MEGANet-W consistently
outperforms existing methods, improving mIoU by up to 2.3% and mDice by 1.2%,
while introducing no additional learnable parameters. This approach improves
reliability in difficult cases and offers a robust solution for medical image
segmentation tasks requiring precise boundary detection.
comment: This work has been submitted to the IEEE for possible publication
♻ ☆ Superpixel-based Color Transfer
In this work, we propose a fast superpixel-based color transfer method (SCT)
between two images. Superpixels enable to decrease the image dimension and to
extract a reduced set of color candidates. We propose to use a fast approximate
nearest neighbor matching algorithm in which we enforce the match diversity by
limiting the selection of the same superpixels. A fusion framework is designed
to transfer the matched colors, and we demonstrate the improvement obtained
over exact matching results. Finally, we show that SCT is visually competitive
compared to state-of-the-art methods.
♻ ☆ Lightweight Gradient-Aware Upscaling of 3D Gaussian Splatting Images
We introduce an image upscaling technique tailored for 3D Gaussian Splatting
(3DGS) on lightweight GPUs. Compared to 3DGS, it achieves significantly higher
rendering speeds and reduces artifacts commonly observed in 3DGS
reconstructions. Our technique upscales low-resolution 3DGS renderings with a
marginal increase in cost by directly leveraging the analytical image gradients
of Gaussians for gradient-based bicubic spline interpolation. The technique is
agnostic to the specific 3DGS implementation, achieving novel view synthesis at
rates 3x-4x higher than the baseline implementation. Through extensive
experiments on multiple datasets, we showcase the performance improvements and
high reconstruction fidelity attainable with gradient-aware upscaling of 3DGS
images. We further demonstrate the integration of gradient-aware upscaling into
the gradient-based optimization of a 3DGS model and analyze its effects on
reconstruction quality and performance.
♻ ☆ Visible Yet Unreadable: A Systematic Blind Spot of Vision Language Models Across Writing Systems
Writing is a universal cultural technology that reuses vision for symbolic
communication. Humans display striking resilience: we readily recognize words
even when characters are fragmented, fused, or partially occluded. This paper
investigates whether advanced vision language models (VLMs) share this
resilience. We construct two psychophysics inspired benchmarks across distinct
writing systems, Chinese logographs and English alphabetic words, by splicing,
recombining, and overlaying glyphs to yield ''visible but unreadable'' stimuli
for models while remaining legible to humans. Despite strong performance on
clean text, contemporary VLMs show a severe drop under these perturbations,
frequently producing unrelated or incoherent outputs. The pattern suggests a
structural limitation: models heavily leverage generic visual invariances but
under rely on compositional priors needed for robust literacy. We release
stimuli generation code, prompts, and evaluation protocols to facilitate
transparent replication and follow up work. Our findings motivate architectures
and training strategies that encode symbol segmentation, composition, and
binding across scripts, and they delineate concrete challenges for deploying
multimodal systems in education, accessibility, cultural heritage, and
security.
♻ ☆ A Novel Compression Framework for YOLOv8: Achieving Real-Time Aerial Object Detection on Edge Devices via Structured Pruning and Channel-Wise Distillation
Efficient deployment of deep learning models for aerial object detection on
resource-constrained devices requires significant compression without
com-promising performance. In this study, we propose a novel three-stage
compression pipeline for the YOLOv8 object detection model, integrating
sparsity-aware training, structured channel pruning, and Channel-Wise Knowledge
Distillation (CWD). First, sparsity-aware training introduces dynamic sparsity
during model optimization, effectively balancing parameter reduction and
detection accuracy. Second, we apply structured channel pruning by leveraging
batch normalization scaling factors to eliminate redundant channels,
significantly reducing model size and computational complexity. Finally, to
mitigate the accuracy drop caused by pruning, we employ CWD to transfer
knowledge from the original model, using an adjustable temperature and loss
weighting scheme tailored for small and medium object detection. Extensive
experiments on the VisDrone dataset demonstrate the effectiveness of our
approach across multiple YOLOv8 variants. For YOLOv8m, our method reduces model
parameters from 25.85M to 6.85M (a 73.51% reduction), FLOPs from 49.6G to
13.3G, and MACs from 101G to 34.5G, while reducing AP50 by only 2.7%. The
resulting compressed model achieves 47.9 AP50 and boosts inference speed from
26 FPS (YOLOv8m baseline) to 45 FPS, enabling real-time deployment on edge
devices. We further apply TensorRT as a lightweight optimization step. While
this introduces a minor drop in AP50 (from 47.9 to 47.6), it significantly
improves inference speed from 45 to 68 FPS, demonstrating the practicality of
our approach for high-throughput, re-source-constrained scenarios.
comment: 28 pages, 11 figures
♻ ☆ Structured Preference Optimization for Vision-Language Long-Horizon Task Planning
Xiwen Liang, Min Lin, Weiqi Ruan, Rongtao Xu, Yuecheng Liu, Jiaqi Chen, Bingqian Lin, Yuzheng Zhuang, Xiaodan Liang
Existing methods for vision-language task planning excel in short-horizon
tasks but often fall short in complex, long-horizon planning within dynamic
environments. These challenges primarily arise from the difficulty of
effectively training models to produce high-quality reasoning processes for
long-horizon tasks. To address this, we propose Structured Preference
Optimization (SPO), which aims to enhance reasoning and action selection in
long-horizon task planning through structured preference evaluation and
optimized training strategies. Specifically, SPO introduces: 1)
Preference-Based Scoring and Optimization, which systematically evaluates
reasoning chains based on task relevance, visual grounding, and historical
consistency; and 2) Curriculum-Guided Training, where the model progressively
adapts from simple to complex tasks, improving its generalization ability in
long-horizon scenarios and enhancing reasoning robustness. To advance research
in vision-language long-horizon task planning, we introduce ExtendaBench, a
comprehensive benchmark covering 1,509 tasks across VirtualHome and Habitat
2.0, categorized into ultra-short, short, medium, and long tasks. Experimental
results demonstrate that SPO significantly improves reasoning quality and final
decision accuracy, outperforming prior methods on long-horizon tasks and
underscoring the effectiveness of preference-driven optimization in
vision-language task planning. Specifically, SPO achieves a +5.98% GCR and
+4.68% SR improvement in VirtualHome and a +3.30% GCR and +2.11% SR improvement
in Habitat over the best-performing baselines.
comment: 18 pages
♻ ☆ Enhancing Intent Understanding for Ambiguous prompt: A Human-Machine Co-Adaption Strategy
Yangfan He, Jianhui Wang, Yijin Wang, Yan Zhong, Xinyuan Song, Junjiang Lin, Xinhang Yuan, Jingqun Tang, Yi Xin, Hao Zhang, Yuchen Li, Zijian Zhang, Hongyang He, Tianxiang Xu, Miao Zhang, Kuan Lu, Menghao Huo, Keqin Li, Jiaqi Chen, Tianyu Shi, Jianyuan Ni
Current image generation systems produce high-quality images but struggle
with ambiguous user prompts, making interpretation of actual user intentions
difficult. Many users must modify their prompts several times to ensure the
generated images meet their expectations. While some methods focus on enhancing
prompts to make the generated images fit user needs, the model is still hard to
understand users' real needs, especially for non-expert users. In this
research, we aim to enhance the visual parameter-tuning process, making the
model user-friendly for individuals without specialized knowledge and better
understand user needs. We propose a human-machine co-adaption strategy using
mutual information between the user's prompts and the pictures under
modification as the optimizing target to make the system better adapt to user
needs. We find that an improved model can reduce the necessity for multiple
rounds of adjustments. We also collect multi-round dialogue datasets with
prompts and images pairs and user intent. Various experiments demonstrate the
effectiveness of the proposed method in our proposed dataset. Our dataset and
annotation tools will be available.
♻ ☆ Stereo Anything: Unifying Zero-shot Stereo Matching with Large-Scale Mixed Data
Xianda Guo, Chenming Zhang, Youmin Zhang, Ruilin Wang, Dujun Nie, Wenzhao Zheng, Matteo Poggi, Hao Zhao, Mang Ye, Qin Zou, Long Chen
Stereo matching serves as a cornerstone in 3D vision, aiming to establish
pixel-wise correspondences between stereo image pairs for depth recovery.
Despite remarkable progress driven by deep neural architectures, current models
often exhibit severe performance degradation when deployed in unseen domains,
primarily due to the limited diversity of training data. In this work, we
introduce StereoAnything, a data-centric framework that substantially enhances
the zero-shot generalization capability of existing stereo models. Rather than
devising yet another specialized architecture, we scale stereo training to an
unprecedented level by systematically unifying heterogeneous stereo sources:
(1) curated labeled datasets covering diverse environments, and (2) large-scale
synthetic stereo pairs generated from unlabeled monocular images. Our
mixed-data strategy delivers consistent and robust learning signals across
domains, effectively mitigating dataset bias. Extensive zero-shot evaluations
on four public benchmarks demonstrate that Stereo Anything achieves
state-of-the-art generalization. This work paves the way towards truly
universal stereo matching, offering a scalable data paradigm applicable to any
stereo image pair. We extensively evaluate the zero-shot capabilities of our
model on four public datasets, showcasing its impressive ability to generalize
to any stereo image pair. Code is available at
https://github.com/XiandaGuo/OpenStereo.
comment: Code will be available at
\url{https://github.com/XiandaGuo/OpenStereo}
♻ ☆ Leveraging Perceptual Scores for Dataset Pruning in Computer Vision Tasks CVPR 2024
In this paper we propose a score of an image to use for coreset selection in
image classification and semantic segmentation tasks. The score is the entropy
of an image as approximated by the bits-per-pixel of its compressed version.
Thus the score is intrinsic to an image and does not require supervision or
training. It is very simple to compute and readily available as all images are
stored in a compressed format. The motivation behind our choice of score is
that most other scores proposed in literature are expensive to compute. More
importantly, we want a score that captures the perceptual complexity of an
image. Entropy is one such measure, images with clutter tend to have a higher
entropy. However sampling only low entropy iconic images, for example, leads to
biased learning and an overall decrease in test performance with current deep
learning models. To mitigate the bias we use a graph based method that
increases the spatial diversity of the selected samples. We show that this
simple score yields good results, particularly for semantic segmentation tasks.
comment: NON ARCHIVAL PRESENTATION 1st workshop on Dataset Distillation CVPR
2024
♻ ☆ CROP: Contextual Region-Oriented Visual Token Pruning EMNLP2025
Current VLM-based VQA methods often process entire images, leading to
excessive visual tokens that include redundant information irrelevant to the
posed question. This abundance of unnecessary image details creates numerous
visual tokens, drastically increasing memory and computational requirements in
VLMs. To address this, we propose Contextual Region-Oriented Visual Token
Pruning (CROP), a novel framework to compress visual tokens through a two-step
process: Localization and Pruning. Specifically, CROP first employs an
efficient model to identify the contextual region relevant to the input query.
Subsequently, two distinct strategies are introduced for pruning: (1) Pre-LLM
Compression (PLC), which adaptively compresses different image regions with
varying ratios, and (2) Inner-LLM Pruning (ILP), a training-free method that
prunes tokens within early LLM layers guided by the identified contextual
region. Extensive experiments on a wide range of VQA tasks demonstrate that
CROP significantly outperforms existing visual token pruning methods and
achieves state-of-the-art performance.
comment: EMNLP2025 Main
♻ ☆ Data-Efficient Fine-Tuning of Vision-Language Models for Diagnosis of Alzheimer's Disease
Medical vision-language models (Med-VLMs) have shown impressive results in
tasks such as report generation and visual question answering, but they still
face several limitations. Most notably, they underutilize patient metadata and
lack integration of clinical diagnostic knowledge. Moreover, most existing
models are typically trained from scratch or fine-tuned on large-scale 2D
image-text pairs, requiring extensive computational resources, and their
effectiveness on 3D medical imaging is often limited due to the absence of
structural information. To address these gaps, we propose a data-efficient
fine-tuning pipeline to adapt 3D CT-based Med-VLMs for 3D MRI and demonstrate
its application in Alzheimer's disease (AD) diagnosis. Our system introduces
two key innovations. First, we convert structured metadata into synthetic
reports, enriching textual input for improved image-text alignment. Second, we
add an auxiliary token trained to predict the mini-mental state examination
(MMSE) score, a widely used clinical measure of cognitive function that
correlates with AD severity. This provides additional supervision for
fine-tuning. Applying lightweight prompt tuning to both image and text
modalities, our approach achieves state-of-the-art performance on two AD
datasets using 1,500 training images, outperforming existing methods fine-tuned
on 10,000 images. Code will be released upon publication.
♻ ☆ Direct Video-Based Spatiotemporal Deep Learning for Cattle Lameness Detection
Cattle lameness is a prevalent health problem in livestock farming, often
resulting from hoof injuries or infections, and severely impacts animal welfare
and productivity. Early and accurate detection is critical for minimizing
economic losses and ensuring proper treatment. This study proposes a
spatiotemporal deep learning framework for automated cattle lameness detection
using publicly available video data. We curate and publicly release a balanced
set of 50 online video clips featuring 42 individual cattle, recorded from
multiple viewpoints in both indoor and outdoor environments. The videos were
categorized into lame and non-lame classes based on visual gait characteristics
and metadata descriptions. After applying data augmentation techniques to
enhance generalization, two deep learning architectures were trained and
evaluated: 3D Convolutional Neural Networks (3D CNN) and Convolutional
Long-Short-Term Memory (ConvLSTM2D). The 3D CNN achieved a video-level
classification accuracy of 90%, with a precision, recall, and F1 score of 90.9%
each, outperforming the ConvLSTM2D model, which achieved 85% accuracy. Unlike
conventional approaches that rely on multistage pipelines involving object
detection and pose estimation, this study demonstrates the effectiveness of a
direct end-to-end video classification approach. Compared with the best
end-to-end prior method (C3D-ConvLSTM, 90.3%), our model achieves comparable
accuracy while eliminating pose estimation pre-processing.The results indicate
that deep learning models can successfully extract and learn spatio-temporal
features from various video sources, enabling scalable and efficient cattle
lameness detection in real-world farm settings.
♻ ☆ Improving Generalizability of Kolmogorov-Arnold Networks via Error-Correcting Output Codes IEEE
Kolmogorov-Arnold Networks (KAN) offer universal function approximation using
univariate spline compositions without nonlinear activations. In this work, we
integrate Error-Correcting Output Codes (ECOC) into the KAN framework to
transform multi-class classification into multiple binary tasks, improving
robustness via Hamming distance decoding. Our proposed KAN with ECOC framework
outperforms vanilla KAN on a challenging blood cell classification dataset,
achieving higher accuracy across diverse hyperparameter settings. Ablation
studies further confirm that ECOC consistently enhances performance across
FastKAN and FasterKAN variants. These results demonstrate that ECOC integration
significantly boosts KAN generalizability in critical healthcare AI
applications. To the best of our knowledge, this is the first work of ECOC with
KAN for enhancing multi-class medical image classification performance.
comment: Accepted to IEEE BioCAS 2025
♻ ☆ Humor in Pixels: Benchmarking Large Multimodal Models Understanding of Online Comics EMNLP 2025
Understanding humor is a core aspect of social intelligence, yet it remains a
significant challenge for Large Multimodal Models (LMMs). We introduce
PixelHumor, a benchmark dataset of 2,800 annotated multi-panel comics designed
to evaluate LMMs' ability to interpret multimodal humor and recognize narrative
sequences. Experiments with state-of-the-art LMMs reveal substantial gaps: for
instance, top models achieve only 61% accuracy in panel sequencing, far below
human performance. This underscores critical limitations in current models'
integration of visual and textual cues for coherent narrative and humor
understanding. By providing a rigorous framework for evaluating multimodal
contextual and narrative reasoning, PixelHumor aims to drive the development of
LMMs that better engage in natural, socially aware interactions.
comment: 27 pages, 8 figures, EMNLP 2025 Findings
♻ ☆ Kling-Avatar: Grounding Multimodal Instructions for Cascaded Long-Duration Avatar Animation Synthesis
Yikang Ding, Jiwen Liu, Wenyuan Zhang, Zekun Wang, Wentao Hu, Liyuan Cui, Mingming Lao, Yingchao Shao, Hui Liu, Xiaohan Li, Ming Chen, Xiaoqiang Liu, Yu-Shen Liu, Pengfei Wan
Recent advances in audio-driven avatar video generation have significantly
enhanced audio-visual realism. However, existing methods treat instruction
conditioning merely as low-level tracking driven by acoustic or visual cues,
without modeling the communicative purpose conveyed by the instructions. This
limitation compromises their narrative coherence and character expressiveness.
To bridge this gap, we introduce Kling-Avatar, a novel cascaded framework that
unifies multimodal instruction understanding with photorealistic portrait
generation. Our approach adopts a two-stage pipeline. In the first stage, we
design a multimodal large language model (MLLM) director that produces a
blueprint video conditioned on diverse instruction signals, thereby governing
high-level semantics such as character motion and emotions. In the second
stage, guided by blueprint keyframes, we generate multiple sub-clips in
parallel using a first-last frame strategy. This global-to-local framework
preserves fine-grained details while faithfully encoding the high-level intent
behind multimodal instructions. Our parallel architecture also enables fast and
stable generation of long-duration videos, making it suitable for real-world
applications such as digital human livestreaming and vlogging. To
comprehensively evaluate our method, we construct a benchmark of 375 curated
samples covering diverse instructions and challenging scenarios. Extensive
experiments demonstrate that Kling-Avatar is capable of generating vivid,
fluent, long-duration videos at up to 1080p and 48 fps, achieving superior
performance in lip synchronization accuracy, emotion and dynamic
expressiveness, instruction controllability, identity preservation, and
cross-domain generalization. These results establish Kling-Avatar as a new
benchmark for semantically grounded, high-fidelity audio-driven avatar
synthesis.
comment: Technical Report. Project Page: https://klingavatar.github.io/
♻ ☆ PlaneRecTR++: Unified Query Learning for Joint 3D Planar Reconstruction and Pose Estimation IEEE
The challenging task of 3D planar reconstruction from images involves several
sub-tasks including frame-wise plane detection, segmentation, parameter
regression and possibly depth prediction, along with cross-frame plane
correspondence and relative camera pose estimation. Previous works adopt a
divide and conquer strategy, addressing above sub-tasks with distinct network
modules in a two-stage paradigm. Specifically, given an initial camera pose and
per-frame plane predictions from the first stage, further exclusively designed
modules relying on external plane correspondence labeling are applied to merge
multi-view plane entities and produce refined camera pose. Notably, existing
work fails to integrate these closely related sub-tasks into a unified
framework, and instead addresses them separately and sequentially, which we
identify as a primary source of performance limitations. Motivated by this
finding and the success of query-based learning in enriching reasoning among
semantic entities, in this paper, we propose PlaneRecTR++, a Transformer-based
architecture, which for the first time unifies all tasks of multi-view planar
reconstruction and pose estimation within a compact single-stage framework,
eliminating the need for the initial pose estimation and supervision of plane
correspondence. Extensive quantitative and qualitative experiments demonstrate
that our proposed unified learning achieves mutual benefits across sub-tasks,
achieving a new state-of-the-art performance on the public ScanNetv1,
ScanNetv2, NYUv2-Plane, and MatterPort3D datasets. Codes are available at
https://github.com/SJingjia/PlaneRecTR-PP.
comment: To be published in IEEE T-PAMI 2025. This is the journal extension of
our ICCV 2023 paper "PlaneRecTR", which expands from single view
reconstruction to simultaneous multi-view reconstruction and camera pose
estimation. Note that the ICCV2023 PlaneRecTR paper could be found in the
previous arxiv version [v2](arXiv:2307.13756v2)
♻ ☆ DPDEdit: Detail-Preserved Diffusion Models for Multimodal Fashion Image Editing
Fashion image editing is a crucial tool for designers to convey their
creative ideas by visualizing design concepts interactively. Current fashion
image editing techniques, though advanced with multimodal prompts and powerful
diffusion models, often struggle to accurately identify editing regions and
preserve the desired garment texture detail. To address these challenges, we
introduce a new multimodal fashion image editing architecture based on latent
diffusion models, called Detail-Preserved Diffusion Models (DPDEdit). DPDEdit
guides the fashion image generation of diffusion models by integrating text
prompts, region masks, human pose images, and garment texture images. To
precisely locate the editing region, we first introduce Grounded-SAM to predict
the editing region based on the user's textual description, and then combine it
with other conditions to perform local editing. To transfer the detail of the
given garment texture into the target fashion image, we propose a texture
injection and refinement mechanism. Specifically, this mechanism employs a
decoupled cross-attention layer to integrate textual descriptions and texture
images, and incorporates an auxiliary U-Net to preserve the high-frequency
details of generated garment texture. Additionally, we extend the VITON-HD
dataset using a multimodal large language model to generate paired samples with
texture images and textual descriptions. Extensive experiments show that our
DPDEdit outperforms state-of-the-art methods in terms of image fidelity and
coherence with the given multimodal inputs.
comment: 13 pages,12 figures
♻ ☆ TrajBooster: Boosting Humanoid Whole-Body Manipulation via Trajectory-Centric Learning
Jiacheng Liu, Pengxiang Ding, Qihang Zhou, Yuxuan Wu, Da Huang, Zimian Peng, Wei Xiao, Weinan Zhang, Lixin Yang, Cewu Lu, Donglin Wang
Recent Vision-Language-Action models show potential to generalize across
embodiments but struggle to quickly align with a new robot's action space when
high-quality demonstrations are scarce, especially for bipedal humanoids. We
present TrajBooster, a cross-embodiment framework that leverages abundant
wheeled-humanoid data to boost bipedal VLA. Our key idea is to use end-effector
trajectories as a morphology-agnostic interface. TrajBooster (i) extracts 6D
dual-arm end-effector trajectories from real-world wheeled humanoids, (ii)
retargets them in simulation to Unitree G1 with a whole-body controller trained
via a heuristic-enhanced harmonized online DAgger to lift low-dimensional
trajectory references into feasible high-dimensional whole-body actions, and
(iii) forms heterogeneous triplets that couple source vision/language with
target humanoid-compatible actions to post-pre-train a VLA, followed by only 10
minutes of teleoperation data collection on the target humanoid domain.
Deployed on Unitree G1, our policy achieves beyond-tabletop household tasks,
enabling squatting, cross-height manipulation, and coordinated whole-body
motion with markedly improved robustness and generalization. Results show that
TrajBooster allows existing wheeled-humanoid data to efficiently strengthen
bipedal humanoid VLA performance, reducing reliance on costly same-embodiment
data while enhancing action space understanding and zero-shot skill transfer
capabilities. For more details, For more details, please refer to our
\href{https://jiachengliu3.github.io/TrajBooster/}.
♻ ☆ Enhancing Generalization in Vision-Language-Action Models by Preserving Pretrained Representations
Vision-language-action (VLA) models finetuned from vision-language models
(VLMs) hold the promise of leveraging rich pretrained representations to build
generalist robots across diverse tasks and environments. However, direct
fine-tuning on robot data often disrupts these representations and limits
generalization. We present a framework that better preserves pretrained
features while adapting them for robot manipulation. Our approach introduces
three components: (i) a dual-encoder design with one frozen vision encoder to
retain pretrained features and another trainable for task adaptation, (ii) a
string-based action tokenizer that casts continuous actions into character
sequences aligned with the model's pretraining domain, and (iii) a co-training
strategy that combines robot demonstrations with vision-language datasets
emphasizing spatial reasoning and affordances. Evaluations in simulation and on
real robots show that our method improves robustness to visual perturbations,
generalization to novel instructions and environments, and overall task success
compared to baselines.
comment: Project Page: https://gen-vla.github.io/
♻ ☆ A Culturally-diverse Multilingual Multimodal Video Benchmark & Model
Bhuiyan Sanjid Shafique, Ashmal Vayani, Muhammad Maaz, Hanoona Abdul Rasheed, Dinura Dissanayake, Mohammed Irfan Kurpath, Yahya Hmaiti, Go Inoue, Jean Lahoud, Md. Safirur Rashid, Shadid Intisar Quasem, Maheen Fatima, Franco Vidal, Mykola Maslych, Ketan Pravin More, Sanoojan Baliah, Hasindri Watawana, Yuhao Li, Fabian Farestam, Leon Schaller, Roman Tymtsiv, Simon Weber, Hisham Cholakkal, Ivan Laptev, Shin'ichi Satoh, Michael Felsberg, Mubarak Shah, Salman Khan, Fahad Shahbaz Khan
Large multimodal models (LMMs) have recently gained attention due to their
effectiveness to understand and generate descriptions of visual content. Most
existing LMMs are in English language. While few recent works explore
multilingual image LMMs, to the best of our knowledge, moving beyond the
English language for cultural and linguistic inclusivity is yet to be
investigated in the context of video LMMs. In pursuit of more inclusive video
LMMs, we introduce a multilingual Video LMM benchmark, named ViMUL-Bench, to
evaluate Video LMMs across 14 languages, including both low- and high-resource
languages: English, Chinese, Spanish, French, German, Hindi, Arabic, Russian,
Bengali, Urdu, Sinhala, Tamil, Swedish, and Japanese. Our ViMUL-Bench is
designed to rigorously test video LMMs across 15 categories including eight
culturally diverse categories, ranging from lifestyles and festivals to foods
and rituals and from local landmarks to prominent cultural personalities.
ViMUL-Bench comprises both open-ended (short and long-form) and multiple-choice
questions spanning various video durations (short, medium, and long) with 8k
samples that are manually verified by native language speakers. In addition, we
also introduce a machine translated multilingual video training set comprising
1.2 million samples and develop a simple multilingual video LMM, named ViMUL,
that is shown to provide a better tradeoff between high-and low-resource
languages for video understanding. We hope our ViMUL-Bench and multilingual
video LMM along with a large-scale multilingual video training set will help
ease future research in developing cultural and linguistic inclusive
multilingual video LMMs. Our proposed benchmark, video LMM and training data
will be publicly released at https://mbzuai-oryx.github.io/ViMUL/.
♻ ☆ GWM: Towards Scalable Gaussian World Models for Robotic Manipulation ICCV 2025
Training robot policies within a learned world model is trending due to the
inefficiency of real-world interactions. The established image-based world
models and policies have shown prior success, but lack robust geometric
information that requires consistent spatial and physical understanding of the
three-dimensional world, even pre-trained on internet-scale video sources. To
this end, we propose a novel branch of world model named Gaussian World Model
(GWM) for robotic manipulation, which reconstructs the future state by
inferring the propagation of Gaussian primitives under the effect of robot
actions. At its core is a latent Diffusion Transformer (DiT) combined with a 3D
variational autoencoder, enabling fine-grained scene-level future state
reconstruction with Gaussian Splatting. GWM can not only enhance the visual
representation for imitation learning agent by self-supervised future
prediction training, but can serve as a neural simulator that supports
model-based reinforcement learning. Both simulated and real-world experiments
depict that GWM can precisely predict future scenes conditioned on diverse
robot actions, and can be further utilized to train policies that outperform
the state-of-the-art by impressive margins, showcasing the initial data scaling
potential of 3D world model.
comment: Published at ICCV 2025. Project page:
https://gaussian-world-model.github.io/
♻ ☆ Skyshield: Event-Driven Submillimetre Thin Obstacle Detection for Drone Flight Safety
Drones operating in complex environments face a significant threat from thin
obstacles, such as steel wires and kite strings at the submillimeter level,
which are notoriously difficult for conventional sensors like RGB cameras,
LiDAR, and depth cameras to detect. This paper introduces SkyShield, an
event-driven, end-to-end framework designed for the perception of submillimeter
scale obstacles. Drawing upon the unique features that thin obstacles present
in the event stream, our method employs a lightweight U-Net architecture and an
innovative Dice-Contour Regularization Loss to ensure precise detection.
Experimental results demonstrate that our event-based approach achieves mean F1
Score of 0.7088 with a low latency of 21.2 ms, making it ideal for deployment
on edge and mobile platforms.
♻ ☆ Deep Learning for Crack Detection: A Review of Learning Paradigms, Generalizability, and Datasets
Crack detection plays a crucial role in civil infrastructures, including
inspection of pavements, buildings, etc., and deep learning has significantly
advanced this field in recent years. While numerous technical and review papers
exist in this domain, emerging trends are reshaping the landscape. These shifts
include transitions in learning paradigms (from fully supervised learning to
semi-supervised, weakly-supervised, unsupervised, few-shot, domain adaptation
and fine-tuning foundation models), improvements in generalizability (from
single-dataset performance to cross-dataset evaluation), and diversification in
dataset acquisition (from RGB images to specialized sensor-based data). In this
review, we systematically analyze these trends and highlight representative
works. Additionally, we introduce a new annotated dataset collected with 3D
laser scans, 3DCrack, to support future research and conduct extensive
benchmarking experiments to establish baselines for commonly used deep learning
methodologies, including recent foundation models. Our findings provide
insights into the evolving methodologies and future directions in deep
learning-based crack detection. Project page:
https://github.com/nantonzhang/Awesome-Crack-Detection
comment: under review
♻ ☆ DiffGAN: A Test Generation Approach for Differential Testing of Deep Neural Networks for Image Analysis IEEE
Deep Neural Networks (DNNs) are increasingly deployed across applications.
However, ensuring their reliability remains a challenge, and in many
situations, alternative models with similar functionality and accuracy are
available. Traditional accuracy-based evaluations often fail to capture
behavioral differences between models, especially with limited test datasets,
making it difficult to select or combine models effectively. Differential
testing addresses this by generating test inputs that expose discrepancies in
DNN model behavior. However, existing approaches face significant limitations:
many rely on model internals or are constrained by available seed inputs. To
address these challenges, we propose DiffGAN, a black-box test image generation
approach for differential testing of DNN models. DiffGAN leverages a Generative
Adversarial Network (GAN) and the Non-dominated Sorting Genetic Algorithm II to
generate diverse and valid triggering inputs that reveal behavioral
discrepancies between models. DiffGAN employs two custom fitness functions,
focusing on diversity and divergence, to guide the exploration of the GAN input
space and identify discrepancies between models' outputs. By strategically
searching this space, DiffGAN generates inputs with specific features that
trigger differences in model behavior. DiffGAN is black-box, making it
applicable in more situations. We evaluate DiffGAN on eight DNN model pairs
trained on widely used image datasets. Our results show DiffGAN significantly
outperforms a SOTA baseline, generating four times more triggering inputs, with
greater diversity and validity, within the same budget. Additionally, the
generated inputs improve the accuracy of a machine learning-based model
selection mechanism, which selects the best-performing model based on input
characteristics and can serve as a smart output voting mechanism when using
alternative models.
comment: Accepted into IEEE Transactions on Software Engineering
♻ ☆ Video-Language Critic: Transferable Reward Functions for Language-Conditioned Robotics
Minttu Alakuijala, Reginald McLean, Isaac Woungang, Nariman Farsad, Samuel Kaski, Pekka Marttinen, Kai Yuan
Natural language is often the easiest and most convenient modality for humans
to specify tasks for robots. However, learning to ground language to behavior
typically requires impractical amounts of diverse, language-annotated
demonstrations collected on each target robot. In this work, we aim to separate
the problem of what to accomplish from how to accomplish it, as the former can
benefit from substantial amounts of external observation-only data, and only
the latter depends on a specific robot embodiment. To this end, we propose
Video-Language Critic, a reward model that can be trained on readily available
cross-embodiment data using contrastive learning and a temporal ranking
objective, and use it to score behavior traces from a separate actor. When
trained on Open X-Embodiment data, our reward model enables 2x more
sample-efficient policy training on Meta-World tasks than a sparse reward only,
despite a significant domain gap. Using in-domain data but in a challenging
task generalization setting on Meta-World, we further demonstrate more
sample-efficient training than is possible with prior language-conditioned
reward models that are either trained with binary classification, use static
images, or do not leverage the temporal information present in video data.
comment: 14 pages in the main text, 22 pages including references and
supplementary materials. 3 figures and 3 tables in the main text, 6 figures
and 3 tables in supplementary materials