Computer Vision and Pattern Recognition 118
☆ Carousel: A High-Resolution Dataset for Multi-Target Automatic Image Cropping
Automatic image cropping is a method for maximizing the human-perceived
quality of cropped regions in photographs. Although several works have proposed
techniques for producing singular crops, little work has addressed the problem
of producing multiple, distinct crops with aesthetic appeal. In this paper, we
motivate the problem with a discussion on modern social media applications,
introduce a dataset of 277 relevant images and human labels, and evaluate the
efficacy of several single-crop models with an image partitioning algorithm as
a pre-processing step. The dataset is available at
https://github.com/RafeLoya/carousel.
comment: Accepted to the Datasets track of VCIP 2025
☆ GentleHumanoid: Learning Upper-body Compliance for Contact-rich Human and Object Interaction
Humanoid robots are expected to operate in human-centered environments where
safe and natural physical interaction is essential. However, most recent
reinforcement learning (RL) policies emphasize rigid tracking and suppress
external forces. Existing impedance-augmented approaches are typically
restricted to base or end-effector control and focus on resisting extreme
forces rather than enabling compliance. We introduce GentleHumanoid, a
framework that integrates impedance control into a whole-body motion tracking
policy to achieve upper-body compliance. At its core is a unified spring-based
formulation that models both resistive contacts (restoring forces when pressing
against surfaces) and guiding contacts (pushes or pulls sampled from human
motion data). This formulation ensures kinematically consistent forces across
the shoulder, elbow, and wrist, while exposing the policy to diverse
interaction scenarios. Safety is further supported through task-adjustable
force thresholds. We evaluate our approach in both simulation and on the
Unitree G1 humanoid across tasks requiring different levels of compliance,
including gentle hugging, sit-to-stand assistance, and safe object
manipulation. Compared to baselines, our policy consistently reduces peak
contact forces while maintaining task success, resulting in smoother and more
natural interactions. These results highlight a step toward humanoid robots
that can safely and effectively collaborate with humans and handle objects in
real-world environments.
comment: Home page: https://gentle-humanoid.axell.top
☆ Tracking and Understanding Object Transformations NeurIPS 2025
Real-world objects frequently undergo state transformations. From an apple
being cut into pieces to a butterfly emerging from its cocoon, tracking through
these changes is important for understanding real-world objects and dynamics.
However, existing methods often lose track of the target object after
transformation, due to significant changes in object appearance. To address
this limitation, we introduce the task of Track Any State: tracking objects
through transformations while detecting and describing state changes,
accompanied by a new benchmark dataset, VOST-TAS. To tackle this problem, we
present TubeletGraph, a zero-shot system that recovers missing objects after
transformation and maps out how object states are evolving over time.
TubeletGraph first identifies potentially overlooked tracks, and determines
whether they should be integrated based on semantic and proximity priors. Then,
it reasons about the added tracks and generates a state graph describing each
observed transformation. TubeletGraph achieves state-of-the-art tracking
performance under transformations, while demonstrating deeper understanding of
object transformations and promising capabilities in temporal grounding and
semantic reasoning for complex object transformations. Code, additional
results, and the benchmark dataset are available at
https://tubelet-graph.github.io.
comment: NeurIPS 2025
☆ InfinityStar: Unified Spacetime AutoRegressive Modeling for Visual Generation NeurIPS 2025
We introduce InfinityStar, a unified spacetime autoregressive framework for
high-resolution image and dynamic video synthesis. Building on the recent
success of autoregressive modeling in both vision and language, our purely
discrete approach jointly captures spatial and temporal dependencies within a
single architecture. This unified design naturally supports a variety of
generation tasks such as text-to-image, text-to-video, image-to-video, and long
interactive video synthesis via straightforward temporal autoregression.
Extensive experiments demonstrate that InfinityStar scores 83.74 on VBench,
outperforming all autoregressive models by large margins, even surpassing some
diffusion competitors like HunyuanVideo. Without extra optimizations, our model
generates a 5s, 720p video approximately 10x faster than leading
diffusion-based methods. To our knowledge, InfinityStar is the first discrete
autoregressive video generator capable of producing industrial level 720p
videos. We release all code and models to foster further research in efficient,
high-quality video generation.
comment: NeurIPS 2025 Oral
☆ X-Diffusion: Training Diffusion Policies on Cross-Embodiment Human Demonstrations
Maximus A. Pace, Prithwish Dan, Chuanruo Ning, Atiksh Bhardwaj, Audrey Du, Edward W. Duan, Wei-Chiu Ma, Kushal Kedia
Human videos can be recorded quickly and at scale, making them an appealing
source of training data for robot learning. However, humans and robots differ
fundamentally in embodiment, resulting in mismatched action execution. Direct
kinematic retargeting of human hand motion can therefore produce actions that
are physically infeasible for robots. Despite these low-level differences,
human demonstrations provide valuable motion cues about how to manipulate and
interact with objects. Our key idea is to exploit the forward diffusion
process: as noise is added to actions, low-level execution differences fade
while high-level task guidance is preserved. We present X-Diffusion, a
principled framework for training diffusion policies that maximally leverages
human data without learning dynamically infeasible motions. X-Diffusion first
trains a classifier to predict whether a noisy action is executed by a human or
robot. Then, a human action is incorporated into policy training only after
adding sufficient noise such that the classifier cannot discern its embodiment.
Actions consistent with robot execution supervise fine-grained denoising at low
noise levels, while mismatched human actions provide only coarse guidance at
higher noise levels. Our experiments show that naive co-training under
execution mismatches degrades policy performance, while X-Diffusion
consistently improves it. Across five manipulation tasks, X-Diffusion achieves
a 16% higher average success rate than the best baseline. The project website
is available at https://portal-cornell.github.io/X-Diffusion/.
★ Cambrian-S: Towards Spatial Supersensing in Video
Shusheng Yang, Jihan Yang, Pinzhi Huang, Ellis Brown, Zihao Yang, Yue Yu, Shengbang Tong, Zihan Zheng, Yifan Xu, Muhan Wang, Daohan Lu, Rob Fergus, Yann LeCun, Li Fei-Fei, Saining Xie
We argue that progress in true multimodal intelligence calls for a shift from
reactive, task-driven systems and brute-force long context towards a broader
paradigm of supersensing. We frame spatial supersensing as four stages beyond
linguistic-only understanding: semantic perception (naming what is seen),
streaming event cognition (maintaining memory across continuous experiences),
implicit 3D spatial cognition (inferring the world behind pixels), and
predictive world modeling (creating internal models that filter and organize
information). Current benchmarks largely test only the early stages, offering
narrow coverage of spatial cognition and rarely challenging models in ways that
require true world modeling. To drive progress in spatial supersensing, we
present VSI-SUPER, a two-part benchmark: VSR (long-horizon visual spatial
recall) and VSC (continual visual spatial counting). These tasks require
arbitrarily long video inputs yet are resistant to brute-force context
expansion. We then test data scaling limits by curating VSI-590K and training
Cambrian-S, achieving +30% absolute improvement on VSI-Bench without
sacrificing general capabilities. Yet performance on VSI-SUPER remains limited,
indicating that scale alone is insufficient for spatial supersensing. We
propose predictive sensing as a path forward, presenting a proof-of-concept in
which a self-supervised next-latent-frame predictor leverages surprise
(prediction error) to drive memory and event segmentation. On VSI-SUPER, this
approach substantially outperforms leading proprietary baselines, showing that
spatial supersensing requires models that not only see but also anticipate,
select, and organize experience.
comment: Website: https://cambrian-mllm.github.io/
☆ SIMS-V: Simulated Instruction-Tuning for Spatial Video Understanding
Despite impressive high-level video comprehension, multimodal language models
struggle with spatial reasoning across time and space. While current spatial
training approaches rely on real-world video data, obtaining diverse footage
with precise spatial annotations remains a bottleneck. To alleviate this
bottleneck, we present SIMS-V -- a systematic data-generation framework that
leverages the privileged information of 3D simulators to create spatially-rich
video training data for multimodal language models. Using this framework, we
investigate which properties of simulated data drive effective real-world
transfer through systematic ablations of question types, mixes, and scales. We
identify a minimal set of three question categories (metric measurement,
perspective-dependent reasoning, and temporal tracking) that prove most
effective for developing transferable spatial intelligence, outperforming
comprehensive coverage despite using fewer question types. These insights
enable highly efficient training: our 7B-parameter video LLM fine-tuned on just
25K simulated examples outperforms the larger 72B baseline and achieves
competitive performance with proprietary models on rigorous real-world spatial
reasoning benchmarks. Our approach demonstrates robust generalization,
maintaining performance on general video understanding while showing
substantial improvements on embodied and real-world spatial tasks.
comment: Project page: https://ellisbrown.github.io/sims-v
☆ Real-to-Sim Robot Policy Evaluation with Gaussian Splatting Simulation of Soft-Body Interactions
Kaifeng Zhang, Shuo Sha, Hanxiao Jiang, Matthew Loper, Hyunjong Song, Guangyan Cai, Zhuo Xu, Xiaochen Hu, Changxi Zheng, Yunzhu Li
Robotic manipulation policies are advancing rapidly, but their direct
evaluation in the real world remains costly, time-consuming, and difficult to
reproduce, particularly for tasks involving deformable objects. Simulation
provides a scalable and systematic alternative, yet existing simulators often
fail to capture the coupled visual and physical complexity of soft-body
interactions. We present a real-to-sim policy evaluation framework that
constructs soft-body digital twins from real-world videos and renders robots,
objects, and environments with photorealistic fidelity using 3D Gaussian
Splatting. We validate our approach on representative deformable manipulation
tasks, including plush toy packing, rope routing, and T-block pushing,
demonstrating that simulated rollouts correlate strongly with real-world
execution performance and reveal key behavioral patterns of learned policies.
Our results suggest that combining physics-informed reconstruction with
high-quality rendering enables reproducible, scalable, and accurate evaluation
of robotic manipulation policies. Website: https://real2sim-eval.github.io/
comment: Website: https://real2sim-eval.github.io/
☆ Benchmark Designers Should "Train on the Test Set" to Expose Exploitable Non-Visual Shortcuts
Robust benchmarks are crucial for evaluating Multimodal Large Language Models
(MLLMs). Yet we find that models can ace many multimodal benchmarks without
strong visual understanding, instead exploiting biases, linguistic priors, and
superficial patterns. This is especially problematic for vision-centric
benchmarks that are meant to require visual inputs. We adopt a diagnostic
principle for benchmark design: if a benchmark can be gamed, it will be.
Designers should therefore try to ``game'' their own benchmarks first, using
diagnostic and debiasing procedures to systematically identify and mitigate
non-visual biases. Effective diagnosis requires directly ``training on the test
set'' -- probing the released test set for its intrinsic, exploitable patterns.
We operationalize this standard with two components. First, we diagnose
benchmark susceptibility using a ``Test-set Stress-Test'' (TsT) methodology.
Our primary diagnostic tool involves fine-tuning a powerful Large Language
Model via $k$-fold cross-validation on exclusively the non-visual, textual
inputs of the test set to reveal shortcut performance and assign each sample a
bias score $s(x)$. We complement this with a lightweight Random Forest-based
diagnostic operating on hand-crafted features for fast, interpretable auditing.
Second, we debias benchmarks by filtering high-bias samples using an
``Iterative Bias Pruning'' (IBP) procedure. Applying this framework to four
benchmarks -- VSI-Bench, CV-Bench, MMMU, and VideoMME -- we uncover pervasive
non-visual biases. As a case study, we apply our full framework to create
VSI-Bench-Debiased, demonstrating reduced non-visual solvability and a wider
vision-blind performance gap than the original.
comment: Project page: https://cambrian-mllm.github.io
☆ Polarization-resolved imaging improves eye tracking
Mantas Žurauskas, Tom Bu, Sanaz Alali, Beyza Kalkanli, Derek Shi, Fernando Alamos, Gauresh Pandit, Christopher Mei, Ali Behrooz, Ramin Mirjalili, Dave Stronks, Alexander Fix, Dmitri Model
Polarization-resolved near-infrared imaging adds a useful optical contrast
mechanism to eye tracking by measuring the polarization state of light
reflected by ocular tissues in addition to its intensity. In this paper we
demonstrate how this contrast can be used to enable eye tracking. Specifically,
we demonstrate that a polarization-enabled eye tracking (PET) system composed
of a polarization--filter--array camera paired with a linearly polarized
near-infrared illuminator can reveal trackable features across the sclera and
gaze-informative patterns on the cornea, largely absent in intensity-only
images. Across a cohort of 346 participants, convolutional neural network based
machine learning models trained on data from PET reduced the median
95th-percentile absolute gaze error by 10--16\% relative to capacity-matched
intensity baselines under nominal conditions and in the presence of eyelid
occlusions, eye-relief changes, and pupil-size variation. These results link
light--tissue polarization effects to practical gains in human--computer
interaction and position PET as a simple, robust sensing modality for future
wearable devices.
☆ NovisVQ: A Streaming Convolutional Neural Network for No-Reference Opinion-Unaware Frame Quality Assessment
Video quality assessment (VQA) is vital for computer vision tasks, but
existing approaches face major limitations: full-reference (FR) metrics require
clean reference videos, and most no-reference (NR) models depend on training on
costly human opinion labels. Moreover, most opinion-unaware NR methods are
image-based, ignoring temporal context critical for video object detection. In
this work, we present a scalable, streaming-based VQA model that is both
no-reference and opinion-unaware. Our model leverages synthetic degradations of
the DAVIS dataset, training a temporal-aware convolutional architecture to
predict FR metrics (LPIPS , PSNR, SSIM) directly from degraded video, without
references at inference. We show that our streaming approach outperforms our
own image-based baseline by generalizing across diverse degradations,
underscoring the value of temporal modeling for scalable VQA in real-world
vision systems. Additionally, we demonstrate that our model achieves higher
correlation with full-reference metrics compared to BRISQUE, a widely-used
opinion-aware image quality assessment baseline, validating the effectiveness
of our temporal, opinion-unaware approach.
☆ Building Trust in Virtual Immunohistochemistry: Automated Assessment of Image Quality
Tushar Kataria, Shikha Dubey, Mary Bronner, Jolanta Jedrzkiewicz, Ben J. Brintz, Shireen Y. Elhabian, Beatrice S. Knudsen
Deep learning models can generate virtual immunohistochemistry (IHC) stains
from hematoxylin and eosin (H&E) images, offering a scalable and low-cost
alternative to laboratory IHC. However, reliable evaluation of image quality
remains a challenge as current texture- and distribution-based metrics quantify
image fidelity rather than the accuracy of IHC staining. Here, we introduce an
automated and accuracy grounded framework to determine image quality across
sixteen paired or unpaired image translation models. Using color deconvolution,
we generate masks of pixels stained brown (i.e., IHC-positive) as predicted by
each virtual IHC model. We use the segmented masks of real and virtual IHC to
compute stain accuracy metrics (Dice, IoU, Hausdorff distance) that directly
quantify correct pixel - level labeling without needing expert manual
annotations. Our results demonstrate that conventional image fidelity metrics,
including Frechet Inception Distance (FID), peak signal-to-noise ratio (PSNR),
and structural similarity (SSIM), correlate poorly with stain accuracy and
pathologist assessment. Paired models such as PyramidPix2Pix and AdaptiveNCE
achieve the highest stain accuracy, whereas unpaired diffusion- and GAN-based
models are less reliable in providing accurate IHC positive pixel labels.
Moreover, whole-slide images (WSI) reveal performance declines that are
invisible in patch-based evaluations, emphasizing the need for WSI-level
benchmarks. Together, this framework defines a reproducible approach for
assessing the quality of virtual IHC models, a critical step to accelerate
translation towards routine use by pathologists.
☆ PixCLIP: Achieving Fine-grained Visual Language Understanding via Any-granularity Pixel-Text Alignment Learning
While the Contrastive Language-Image Pretraining(CLIP) model has achieved
remarkable success in a variety of downstream vison language understanding
tasks, enhancing its capability for fine-grained image-text alignment remains
an active research focus. To this end, most existing works adopt the strategy
of explicitly increasing the granularity of visual information processing,
e.g., incorporating visual prompts to guide the model focus on specific local
regions within the image. Meanwhile, researches on Multimodal Large Language
Models(MLLMs) have demonstrated that training with long and detailed textual
descriptions can effectively improve the model's fine-grained vision-language
alignment. However, the inherent token length limitation of CLIP's text encoder
fundamentally limits CLIP to process more granular textual information embedded
in long text sequences. To synergistically leverage the advantages of enhancing
both visual and textual content processing granularity, we propose PixCLIP, a
novel framework designed to concurrently accommodate visual prompt inputs and
process lengthy textual descriptions. Specifically, we first establish an
automated annotation pipeline capable of generating pixel-level localized,
long-form textual descriptions for images. Utilizing this pipeline, we
construct LongGRIT, a high-quality dataset comprising nearly 1.5 million
samples. Secondly, we replace CLIP's original text encoder with the LLM and
propose a three-branch pixel-text alignment learning framework, facilitating
fine-grained alignment between image regions and corresponding textual
descriptions at arbitrary granularity. Experiments demonstrate that PixCLIP
showcases breakthroughs in pixel-level interaction and handling long-form
texts, achieving state-of-the-art performance.
☆ UniSplat: Unified Spatio-Temporal Fusion via 3D Latent Scaffolds for Dynamic Driving Scene Reconstruction
Feed-forward 3D reconstruction for autonomous driving has advanced rapidly,
yet existing methods struggle with the joint challenges of sparse,
non-overlapping camera views and complex scene dynamics. We present UniSplat, a
general feed-forward framework that learns robust dynamic scene reconstruction
through unified latent spatio-temporal fusion. UniSplat constructs a 3D latent
scaffold, a structured representation that captures geometric and semantic
scene context by leveraging pretrained foundation models. To effectively
integrate information across spatial views and temporal frames, we introduce an
efficient fusion mechanism that operates directly within the 3D scaffold,
enabling consistent spatio-temporal alignment. To ensure complete and detailed
reconstructions, we design a dual-branch decoder that generates dynamic-aware
Gaussians from the fused scaffold by combining point-anchored refinement with
voxel-based generation, and maintain a persistent memory of static Gaussians to
enable streaming scene completion beyond current camera coverage. Extensive
experiments on real-world datasets demonstrate that UniSplat achieves
state-of-the-art performance in novel view synthesis, while providing robust
and high-quality renderings even for viewpoints outside the original camera
coverage.
☆ Jr. AI Scientist and Its Risk Report: Autonomous Scientific Exploration from a Baseline Paper
Understanding the current capabilities and risks of AI Scientist systems is
essential for ensuring trustworthy and sustainable AI-driven scientific
progress while preserving the integrity of the academic ecosystem. To this end,
we develop Jr. AI Scientist, a state-of-the-art autonomous AI scientist system
that mimics the core research workflow of a novice student researcher: Given
the baseline paper from the human mentor, it analyzes its limitations,
formulates novel hypotheses for improvement, validates them through rigorous
experimentation, and writes a paper with the results. Unlike previous
approaches that assume full automation or operate on small-scale code, Jr. AI
Scientist follows a well-defined research workflow and leverages modern coding
agents to handle complex, multi-file implementations, leading to scientifically
valuable contributions. For evaluation, we conducted automated assessments
using AI Reviewers, author-led evaluations, and submissions to Agents4Science,
a venue dedicated to AI-driven scientific contributions. The findings
demonstrate that Jr. AI Scientist generates papers receiving higher review
scores than existing fully automated systems. Nevertheless, we identify
important limitations from both the author evaluation and the Agents4Science
reviews, indicating the potential risks of directly applying current AI
Scientist systems and key challenges for future research. Finally, we
comprehensively report various risks identified during development. We hope
these insights will deepen understanding of current progress and risks in AI
Scientist development.
comment: Issues, comments, and questions are all welcome in
https://github.com/Agent4Science-UTokyo/Jr.AI-Scientist
☆ Thinking with Video: Video Generation as a Promising Multimodal Reasoning Paradigm
Jingqi Tong, Yurong Mou, Hangcheng Li, Mingzhe Li, Yongzhuo Yang, Ming Zhang, Qiguang Chen, Tianyi Liang, Xiaomeng Hu, Yining Zheng, Xinchi Chen, Jun Zhao, Xuanjing Huang, Xipeng Qiu
"Thinking with Text" and "Thinking with Images" paradigm significantly
improve the reasoning ability of large language models (LLMs) and Vision
Language Models (VLMs). However, these paradigms have inherent limitations. (1)
Images capture only single moments and fail to represent dynamic processes or
continuous changes, and (2) The separation of text and vision as distinct
modalities, hindering unified multimodal understanding and generation. To
overcome these limitations, we introduce "Thinking with Video", a new paradigm
that leverages video generation models, such as Sora-2, to bridge visual and
textual reasoning in a unified temporal framework. To support this exploration,
we developed the Video Thinking Benchmark (VideoThinkBench). VideoThinkBench
encompasses two task categories: (1) vision-centric tasks (e.g., Eyeballing
Puzzles), and (2) text-centric tasks (e.g., subsets of GSM8K, MMMU). Our
evaluation establishes Sora-2 as a capable reasoner. On vision-centric tasks,
Sora-2 is generally comparable to state-of-the-art (SOTA) VLMs, and even
surpasses VLMs on several tasks, such as Eyeballing Games. On text-centric
tasks, Sora-2 achieves 92% accuracy on MATH, and 75.53% accuracy on MMMU.
Furthermore, we systematically analyse the source of these abilities. We also
find that self-consistency and in-context learning can improve Sora-2's
performance. In summary, our findings demonstrate that the video generation
model is the potential unified multimodal understanding and generation model,
positions "thinking with video" as a unified multimodal reasoning paradigm.
comment: 36 pages, 14 figures
☆ Evo-1: Lightweight Vision-Language-Action Model with Preserved Semantic Alignment
Tao Lin, Yilei Zhong, Yuxin Du, Jingjing Zhang, Jiting Liu, Yinxinyu Chen, Encheng Gu, Ziyan Liu, Hongyi Cai, Yanwen Zou, Lixing Zou, Zhaoye Zhou, Gen Li, Bo Zhao
Vision-Language-Action (VLA) models have emerged as a powerful framework that
unifies perception, language, and control, enabling robots to perform diverse
tasks through multimodal understanding. However, current VLA models typically
contain massive parameters and rely heavily on large-scale robot data
pretraining, leading to high computational costs during training, as well as
limited deployability for real-time inference. Moreover, most training
paradigms often degrade the perceptual representations of the vision-language
backbone, resulting in overfitting and poor generalization to downstream tasks.
In this work, we present Evo-1, a lightweight VLA model that reduces
computation and improves deployment efficiency, while maintaining strong
performance without pretraining on robot data. Evo-1 builds on a native
multimodal Vision-Language model (VLM), incorporating a novel cross-modulated
diffusion transformer along with an optimized integration module, together
forming an effective architecture. We further introduce a two-stage training
paradigm that progressively aligns action with perception, preserving the
representations of the VLM. Notably, with only 0.77 billion parameters, Evo-1
achieves state-of-the-art results on the Meta-World and RoboTwin suite,
surpassing the previous best models by 12.4% and 6.9%, respectively, and also
attains a competitive result of 94.8% on LIBERO. In real-world evaluations,
Evo-1 attains a 78% success rate with high inference frequency and low memory
overhead, outperforming all baseline methods. We release code, data, and model
weights to facilitate future research on lightweight and efficient VLA models.
comment: Github: https://github.com/MINT-SJTU/Evo-1
☆ Learning from Single Timestamps: Complexity Estimation in Laparoscopic Cholecystectomy
Dimitrios Anastasiou, Santiago Barbarisi, Lucy Culshaw, Jayna Patel, Evangelos B. Mazomenos, Imanol Luengo, Danail Stoyanov
Purpose: Accurate assessment of surgical complexity is essential in
Laparoscopic Cholecystectomy (LC), where severe inflammation is associated with
longer operative times and increased risk of postoperative complications. The
Parkland Grading Scale (PGS) provides a clinically validated framework for
stratifying inflammation severity; however, its automation in surgical videos
remains largely unexplored, particularly in realistic scenarios where complete
videos must be analyzed without prior manual curation. Methods: In this work,
we introduce STC-Net, a novel framework for SingleTimestamp-based Complexity
estimation in LC via the PGS, designed to operate under weak temporal
supervision. Unlike prior methods limited to static images or manually trimmed
clips, STC-Net operates directly on full videos. It jointly performs temporal
localization and grading through a localization, window proposal, and grading
module. We introduce a novel loss formulation combining hard and soft
localization objectives and background-aware grading supervision. Results:
Evaluated on a private dataset of 1,859 LC videos, STC-Net achieves an accuracy
of 62.11% and an F1-score of 61.42%, outperforming non-localized baselines by
over 10% in both metrics and highlighting the effectiveness of weak supervision
for surgical complexity assessment. Conclusion: STC-Net demonstrates a scalable
and effective approach for automated PGS-based surgical complexity estimation
from full LC videos, making it promising for post-operative analysis and
surgical training.
☆ THEval. Evaluation Framework for Talking Head Video Generation
Video generation has achieved remarkable progress, with generated videos
increasingly resembling real ones. However, the rapid advance in generation has
outpaced the development of adequate evaluation metrics. Currently, the
assessment of talking head generation primarily relies on limited metrics,
evaluating general video quality, lip synchronization, and on conducting user
studies. Motivated by this, we propose a new evaluation framework comprising 8
metrics related to three dimensions (i) quality, (ii) naturalness, and (iii)
synchronization. In selecting the metrics, we place emphasis on efficiency, as
well as alignment with human preferences. Based on this considerations, we
streamline to analyze fine-grained dynamics of head, mouth, and eyebrows, as
well as face quality. Our extensive experiments on 85,000 videos generated by
17 state-of-the-art models suggest that while many algorithms excel in lip
synchronization, they face challenges with generating expressiveness and
artifact-free details. These videos were generated based on a novel real
dataset, that we have curated, in order to mitigate bias of training data. Our
proposed benchmark framework is aimed at evaluating the improvement of
generative methods. Original code, dataset and leaderboards will be publicly
released and regularly updated with new methods, in order to reflect progress
in the field.
☆ $μ$NeuFMT: Optical-Property-Adaptive Fluorescence Molecular Tomography via Implicit Neural Representation
Shihan Zhao, Jianru Zhang, Yanan Wu, Linlin Li, Siyuan Shen, Xingjun Zhu, Guoyan Zheng, Jiahua Jiang, Wuwei Ren
Fluorescence Molecular Tomography (FMT) is a promising technique for
non-invasive 3D visualization of fluorescent probes, but its reconstruction
remains challenging due to the inherent ill-posedness and reliance on
inaccurate or often-unknown tissue optical properties. While deep learning
methods have shown promise, their supervised nature limits generalization
beyond training data. To address these problems, we propose $\mu$NeuFMT, a
self-supervised FMT reconstruction framework that integrates implicit
neural-based scene representation with explicit physical modeling of photon
propagation. Its key innovation lies in jointly optimize both the fluorescence
distribution and the optical properties ($\mu$) during reconstruction,
eliminating the need for precise prior knowledge of tissue optics or
pre-conditioned training data. We demonstrate that $\mu$NeuFMT robustly
recovers accurate fluorophore distributions and optical coefficients even with
severely erroneous initial values (0.5$\times$ to 2$\times$ of ground truth).
Extensive numerical, phantom, and in vivo validations show that $\mu$NeuFMT
outperforms conventional and supervised deep learning approaches across diverse
heterogeneous scenarios. Our work establishes a new paradigm for robust and
accurate FMT reconstruction, paving the way for more reliable molecular imaging
in complex clinically related scenarios, such as fluorescence guided surgery.
☆ Distribution-Aware Tensor Decomposition for Compression of Convolutional Neural Networks
Neural networks are widely used for image-related tasks but typically demand
considerable computing power. Once a network has been trained, however, its
memory- and compute-footprint can be reduced by compression. In this work, we
focus on compression through tensorization and low-rank representations.
Whereas classical approaches search for a low-rank approximation by minimizing
an isotropic norm such as the Frobenius norm in weight-space, we use
data-informed norms that measure the error in function space. Concretely, we
minimize the change in the layer's output distribution, which can be expressed
as $\lVert (W - \widetilde{W}) \Sigma^{1/2}\rVert_F$ where $\Sigma^{1/2}$ is
the square root of the covariance matrix of the layer's input and $W$,
$\widetilde{W}$ are the original and compressed weights. We propose new
alternating least square algorithms for the two most common tensor
decompositions (Tucker-2 and CPD) that directly optimize the new norm. Unlike
conventional compression pipelines, which almost always require
post-compression fine-tuning, our data-informed approach often achieves
competitive accuracy without any fine-tuning. We further show that the same
covariance-based norm can be transferred from one dataset to another with only
a minor accuracy drop, enabling compression even when the original training
dataset is unavailable. Experiments on several CNN architectures (ResNet-18/50,
and GoogLeNet) and datasets (ImageNet, FGVC-Aircraft, Cifar10, and Cifar100)
confirm the advantages of the proposed method.
☆ Landslide Hazard Mapping with Geospatial Foundation Models: Geographical Generalizability, Data Scarcity, and Band Adaptability
Landslides cause severe damage to lives, infrastructure, and the environment,
making accurate and timely mapping essential for disaster preparedness and
response. However, conventional deep learning models often struggle when
applied across different sensors, regions, or under conditions of limited
training data. To address these challenges, we present a three-axis analytical
framework of sensor, label, and domain for adapting geospatial foundation
models (GeoFMs), focusing on Prithvi-EO-2.0 for landslide mapping. Through a
series of experiments, we show that it consistently outperforms task-specific
CNNs (U-Net, U-Net++), vision transformers (Segformer, SwinV2-B), and other
GeoFMs (TerraMind, SatMAE). The model, built on global pretraining,
self-supervision, and adaptable fine-tuning, proved resilient to spectral
variation, maintained accuracy under label scarcity, and generalized more
reliably across diverse datasets and geographic settings. Alongside these
strengths, we also highlight remaining challenges such as computational cost
and the limited availability of reusable AI-ready training data for landslide
research. Overall, our study positions GeoFMs as a step toward more robust and
scalable approaches for landslide risk reduction and environmental monitoring.
☆ V-Thinker: Interactive Thinking with Images
Runqi Qiao, Qiuna Tan, Minghan Yang, Guanting Dong, Peiqing Yang, Shiqiang Lang, Enhui Wan, Xiaowan Wang, Yida Xu, Lan Yang, Chong Sun, Chen Li, Honggang Zhang
Empowering Large Multimodal Models (LMMs) to deeply integrate image
interaction with long-horizon reasoning capabilities remains a long-standing
challenge in this field. Recent advances in vision-centric reasoning explore a
promising "Thinking with Images" paradigm for LMMs, marking a shift from
image-assisted reasoning to image-interactive thinking. While this milestone
enables models to focus on fine-grained image regions, progress remains
constrained by limited visual tool spaces and task-specific workflow designs.
To bridge this gap, we present V-Thinker, a general-purpose multimodal
reasoning assistant that enables interactive, vision-centric thinking through
end-to-end reinforcement learning. V-Thinker comprises two key components: (1)
a Data Evolution Flywheel that automatically synthesizes, evolves, and verifies
interactive reasoning datasets across three dimensions-diversity, quality, and
difficulty; and (2) a Visual Progressive Training Curriculum that first aligns
perception via point-level supervision, then integrates interactive reasoning
through a two-stage reinforcement learning framework. Furthermore, we introduce
VTBench, an expert-verified benchmark targeting vision-centric interactive
reasoning tasks. Extensive experiments demonstrate that V-Thinker consistently
outperforms strong LMM-based baselines in both general and interactive
reasoning scenarios, providing valuable insights for advancing
image-interactive reasoning applications.
comment: Working in progress
☆ Solving Convex Partition Visual Jigsaw Puzzles
Jigsaw puzzle solving requires the rearrangement of unordered pieces into
their original pose in order to reconstruct a coherent whole, often an image,
and is known to be an intractable problem. While the possible impact of
automatic puzzle solvers can be disruptive in various application domains, most
of the literature has focused on developing solvers for square jigsaw puzzles,
severely limiting their practical use. In this work, we significantly expand
the types of puzzles handled computationally, focusing on what is known as
Convex Partitions, a major subset of polygonal puzzles whose pieces are convex.
We utilize both geometrical and pictorial compatibilities, introduce a greedy
solver, and report several performance measures next to the first benchmark
dataset of such puzzles.
☆ HideAndSeg: an AI-based tool with automated prompting for octopus segmentation in natural habitats
Analyzing octopuses in their natural habitats is challenging due to their
camouflage capability, rapid changes in skin texture and color, non-rigid body
deformations, and frequent occlusions, all of which are compounded by variable
underwater lighting and turbidity. Addressing the lack of large-scale annotated
datasets, this paper introduces HideAndSeg, a novel, minimally supervised
AI-based tool for segmenting videos of octopuses. It establishes a quantitative
baseline for this task. HideAndSeg integrates SAM2 with a custom-trained
YOLOv11 object detector. First, the user provides point coordinates to generate
the initial segmentation masks with SAM2. These masks serve as training data
for the YOLO model. After that, our approach fully automates the pipeline by
providing a bounding box prompt to SAM2, eliminating the need for further
manual intervention. We introduce two unsupervised metrics - temporal
consistency $DICE_t$ and new component count $NC_t$ - to quantitatively
evaluate segmentation quality and guide mask refinement in the absence of
ground-truth data, i.e., real-world information that serves to train, validate,
and test AI models. Results show that HideAndSeg achieves satisfactory
performance, reducing segmentation noise compared to the manually prompted
approach. Our method can re-identify and segment the octopus even after periods
of complete occlusion in natural environments, a scenario in which the manually
prompted model fails. By reducing the need for manual analysis in real-world
scenarios, this work provides a practical tool that paves the way for more
efficient behavioral studies of wild cephalopods.
☆ On the Equivalence of Regression and Classification
A formal link between regression and classification has been tenuous. Even
though the margin maximization term $\|w\|$ is used in support vector
regression, it has at best been justified as a regularizer. We show that a
regression problem with $M$ samples lying on a hyperplane has a one-to-one
equivalence with a linearly separable classification task with $2M$ samples. We
show that margin maximization on the equivalent classification task leads to a
different regression formulation than traditionally used. Using the
equivalence, we demonstrate a ``regressability'' measure, that can be used to
estimate the difficulty of regressing a dataset, without needing to first learn
a model for it. We use the equivalence to train neural networks to learn a
linearizing map, that transforms input variables into a space where a linear
regressor is adequate.
comment: 19 pages
☆ DORAEMON: A Unified Library for Visual Object Modeling and Representation Learning at Scale
DORAEMON is an open-source PyTorch library that unifies visual object
modeling and representation learning across diverse scales. A single
YAML-driven workflow covers classification, retrieval and metric learning; more
than 1000 pretrained backbones are exposed through a timm-compatible interface,
together with modular losses, augmentations and distributed-training utilities.
Reproducible recipes match or exceed reference results on ImageNet-1K,
MS-Celeb-1M and Stanford online products, while one-command export to ONNX or
HuggingFace bridges research and deployment. By consolidating datasets, models,
and training techniques into one platform, DORAEMON offers a scalable
foundation for rapid experimentation in visual recognition and representation
learning, enabling efficient transfer of research advances to real-world
applications. The repository is available at https://github.com/wuji3/DORAEMON.
comment: code: https://github.com/wuji3/DORAEMON
☆ BoRe-Depth: Self-supervised Monocular Depth Estimation with Boundary Refinement for Embedded Systems IROS 2025
Depth estimation is one of the key technologies for realizing 3D perception
in unmanned systems. Monocular depth estimation has been widely researched
because of its low-cost advantage, but the existing methods face the challenges
of poor depth estimation performance and blurred object boundaries on embedded
systems. In this paper, we propose a novel monocular depth estimation model,
BoRe-Depth, which contains only 8.7M parameters. It can accurately estimate
depth maps on embedded systems and significantly improves boundary quality.
Firstly, we design an Enhanced Feature Adaptive Fusion Module (EFAF) which
adaptively fuses depth features to enhance boundary detail representation.
Secondly, we integrate semantic knowledge into the encoder to improve the
object recognition and boundary perception capabilities. Finally, BoRe-Depth is
deployed on NVIDIA Jetson Orin, and runs efficiently at 50.7 FPS. We
demonstrate that the proposed model significantly outperforms previous
lightweight models on multiple challenging datasets, and we provide detailed
ablation studies for the proposed methods. The code is available at
https://github.com/liangxiansheng093/BoRe-Depth.
comment: 8 pages, 5 figures, published to IROS 2025
☆ Multi-Task Learning for Visually Grounded Reasoning in Gastrointestinal VQA
We present a multi-task framework for the MediaEval Medico 2025 challenge,
leveraging a LoRA-tuned Florence-2 model for simultaneous visual question
answering (VQA), explanation generation, and visual grounding. The proposed
system integrates three curated datasets: (1) Kvasir-VQA-x1 for question-answer
learning, (2) a synthetically enriched explanation dataset offering structured
medical reasoning, and (3) text-to-region pairs linking visual features with
segmentation masks. This multi-task setup enables the model to jointly learn
visual grounding, reasoning, and interpretation, producing responses that are
both accurate and interpretable. Extensive evaluation demonstrates that our
approach substantially improves over single-task baselines in both answer
accuracy and visual localization, highlighting the effectiveness of grounded
multi-task learning for medical VQA applications.
comment: This is a working paper submitted for Medico 2025: Visual Question
Answering (with multimodal explanations) for Gastrointestinal Imaging at
MediaEval 2025. 5 pages, 3 figures and 1 table
☆ GraSP-VLA: Graph-based Symbolic Action Representation for Long-Horizon Planning with VLA Policies
Deploying autonomous robots that can learn new skills from demonstrations is
an important challenge of modern robotics. Existing solutions often apply
end-to-end imitation learning with Vision-Language Action (VLA) models or
symbolic approaches with Action Model Learning (AML). On the one hand, current
VLA models are limited by the lack of high-level symbolic planning, which
hinders their abilities in long-horizon tasks. On the other hand, symbolic
approaches in AML lack generalization and scalability perspectives. In this
paper we present a new neuro-symbolic approach, GraSP-VLA, a framework that
uses a Continuous Scene Graph representation to generate a symbolic
representation of human demonstrations. This representation is used to generate
new planning domains during inference and serves as an orchestrator for
low-level VLA policies, scaling up the number of actions that can be reproduced
in a row. Our results show that GraSP-VLA is effective for modeling symbolic
representations on the task of automatic planning domain generation from
observations. In addition, results on real-world experiments show the potential
of our Continuous Scene Graph representation to orchestrate low-level VLA
policies in long-horizon tasks.
☆ A MATLAB tutorial on deep feature extraction combined with chemometrics for analytical applications
Background In analytical chemistry, spatial information about materials is
commonly captured through imaging techniques, such as traditional color cameras
or with advanced hyperspectral cameras and microscopes. However, efficiently
extracting and analyzing this spatial information for exploratory and
predictive purposes remains a challenge, especially when using traditional
chemometric methods. Recent advances in deep learning and artificial
intelligence have significantly enhanced image processing capabilities,
enabling the extraction of multiscale deep features that are otherwise
challenging to capture with conventional image processing techniques. Despite
the wide availability of open-source deep learning models, adoption in
analytical chemistry remains limited because of the absence of structured,
step-by-step guidance for implementing these models.
Results This tutorial aims to bridge this gap by providing a step-by-step
guide for applying deep learning approaches to extract spatial information from
imaging data and integrating it with other data sources, such as spectral
information. Importantly, the focus of this work is not on training deep
learning models for image processing but on using existing open source models
to extract deep features from imaging data.
Significance The tutorial provides MATLAB code tutorial demonstrations,
showcasing the processing of imaging data from various imaging modalities
commonly encountered in analytical chemistry. Readers must run the tutorial
steps on their own datasets using the codes presented in this tutorial.
☆ Evaluating the Impact of Weather-Induced Sensor Occlusion on BEVFusion for 3D Object Detection
Accurate 3D object detection is essential for automated vehicles to navigate
safely in complex real-world environments. Bird's Eye View (BEV)
representations, which project multi-sensor data into a top-down spatial
format, have emerged as a powerful approach for robust perception. Although
BEV-based fusion architectures have demonstrated strong performance through
multimodal integration, the effects of sensor occlusions, caused by
environmental conditions such as fog, haze, or physical obstructions, on 3D
detection accuracy remain underexplored. In this work, we investigate the
impact of occlusions on both camera and Light Detection and Ranging (LiDAR)
outputs using the BEVFusion architecture, evaluated on the nuScenes dataset.
Detection performance is measured using mean Average Precision (mAP) and the
nuScenes Detection Score (NDS). Our results show that moderate camera
occlusions lead to a 41.3% drop in mAP (from 35.6% to 20.9%) when detection is
based only on the camera. On the other hand, LiDAR sharply drops in performance
only under heavy occlusion, with mAP falling by 47.3% (from 64.7% to 34.1%),
with a severe impact on long-range detection. In fused settings, the effect
depends on which sensor is occluded: occluding the camera leads to a minor 4.1%
drop (from 68.5% to 65.7%), while occluding LiDAR results in a larger 26.8%
drop (to 50.1%), revealing the model's stronger reliance on LiDAR for the task
of 3D object detection. Our results highlight the need for future research into
occlusion-aware evaluation methods and improved sensor fusion techniques that
can maintain detection accuracy in the presence of partial sensor failure or
degradation due to adverse environmental conditions.
☆ Comparative Study of CNN Architectures for Binary Classification of Horses and Motorcycles in the VOC 2008 Dataset
This paper presents a comprehensive evaluation of nine convolutional neural
network architectures for binary classification of horses and motorcycles in
the VOC 2008 dataset. We address the significant class imbalance problem by
implementing minority-class augmentation techniques. Our experiments compare
modern architectures including ResNet-50, ConvNeXt-Tiny, DenseNet-121, and
Vision Transformer across multiple performance metrics. Results demonstrate
substantial performance variations, with ConvNeXt-Tiny achieving the highest
Average Precision (AP) of 95.53% for horse detection and 89.12% for motorcycle
detection. We observe that data augmentation significantly improves minority
class detection, particularly benefiting deeper architectures. This study
provides insights into architecture selection for imbalanced binary
classification tasks and quantifies the impact of data augmentation strategies
in mitigating class imbalance issues in object detection.
☆ Submanifold Sparse Convolutional Networks for Automated 3D Segmentation of Kidneys and Kidney Tumours in Computed Tomography
The accurate delineation of tumours in radiological images like Computed
Tomography is a very specialised and time-consuming task, and currently a
bottleneck preventing quantitative analyses to be performed routinely in the
clinical setting. For this reason, developing methods for the automated
segmentation of tumours in medical imaging is of the utmost importance and has
driven significant efforts in recent years. However, challenges regarding the
impracticality of 3D scans, given the large amount of voxels to be analysed,
usually requires the downsampling of such images or using patches thereof when
applying traditional convolutional neural networks. To overcome this problem,
in this paper we propose a new methodology that uses, divided into two stages,
voxel sparsification and submanifold sparse convolutional networks. This method
allows segmentations to be performed with high-resolution inputs and a native
3D model architecture, obtaining state-of-the-art accuracies while
significantly reducing the computational resources needed in terms of GPU
memory and time. We studied the deployment of this methodology in the context
of Computed Tomography images of renal cancer patients from the KiTS23
challenge, and our method achieved results competitive with the challenge
winners, with Dice similarity coefficients of 95.8% for kidneys + masses, 85.7%
for tumours + cysts, and 80.3% for tumours alone. Crucially, our method also
offers significant computational improvements, achieving up to a 60% reduction
in inference time and up to a 75\% reduction in VRAM usage compared to an
equivalent dense architecture, across both CPU and various GPU cards tested.
comment: 12 pages, 5 figures
☆ RISE-T2V: Rephrasing and Injecting Semantics with LLM for Expansive Text-to-Video Generation
Xiangjun Zhang, Litong Gong, Yinglin Zheng, Yansong Liu, Wentao Jiang, Mingyi Xu, Biao Wang, Tiezheng Ge, Ming Zeng
Most text-to-video(T2V) diffusion models depend on pre-trained text encoders
for semantic alignment, yet they often fail to maintain video quality when
provided with concise prompts rather than well-designed ones. The primary issue
lies in their limited textual semantics understanding. Moreover, these text
encoders cannot rephrase prompts online to better align with user intentions,
which limits both the scalability and usability of the models, To address these
challenges, we introduce RISE-T2V, which uniquely integrates the processes of
prompt rephrasing and semantic feature extraction into a single and seamless
step instead of two separate steps. RISE-T2V is universal and can be applied to
various pre-trained LLMs and video diffusion models(VDMs), significantly
enhancing their capabilities for T2V tasks. We propose an innovative module
called the Rephrasing Adapter, enabling diffusion models to utilize text hidden
states during the next token prediction of the LLM as a condition for video
generation. By employing a Rephrasing Adapter, the video generation model can
implicitly rephrase basic prompts into more comprehensive representations that
better match the user's intent. Furthermore, we leverage the powerful
capabilities of LLMs to enable video generation models to accomplish a broader
range of T2V tasks. Extensive experiments demonstrate that RISE-T2V is a
versatile framework applicable to different video diffusion model
architectures, significantly enhancing the ability of T2V models to generate
high-quality videos that align with user intent. Visual results are available
on the webpage at https://rise-t2v.github.io.
comment: 17 pages, 16 figures
☆ Deep learning-based object detection of offshore platforms on Sentinel-1 Imagery and the impact of synthetic training data
The recent and ongoing expansion of marine infrastructure, including offshore
wind farms, oil and gas platforms, artificial islands, and aquaculture
facilities, highlights the need for effective monitoring systems. The
development of robust models for offshore infrastructure detection relies on
comprehensive, balanced datasets, but falls short when samples are scarce,
particularly for underrepresented object classes, shapes, and sizes. By
training deep learning-based YOLOv10 object detection models with a combination
of synthetic and real Sentinel-1 satellite imagery acquired in the fourth
quarter of 2023 from four regions (Caspian Sea, South China Sea, Gulf of
Guinea, and Coast of Brazil), this study investigates the use of synthetic
training data to enhance model performance. We evaluated this approach by
applying the model to detect offshore platforms in three unseen regions (Gulf
of Mexico, North Sea, Persian Gulf) and thereby assess geographic
transferability. This region-holdout evaluation demonstrated that the model
generalises beyond the training areas. In total, 3,529 offshore platforms were
detected, including 411 in the North Sea, 1,519 in the Gulf of Mexico, and
1,593 in the Persian Gulf. The model achieved an F1 score of 0.85, which
improved to 0.90 upon incorporating synthetic data. We analysed how synthetic
data enhances the representation of unbalanced classes and overall model
performance, taking a first step toward globally transferable detection of
offshore infrastructure. This study underscores the importance of balanced
datasets and highlights synthetic data generation as an effective strategy to
address common challenges in remote sensing, demonstrating the potential of
deep learning for scalable, global offshore infrastructure monitoring.
comment: 14 pages, 9 figures
☆ Vision Foundation Models in Agriculture: Toward Domain-Specific Adaptation for Weed Herbicide Trials Assessment
Leire Benito-Del-Valle, Artzai Picón, Daniel Mugica, Manuel Ramos, Eva Portillo, Javier Romero, Carlos Javier Jimenez, Ramón Navarra-Mestre
Herbicide field trials require accurate identification of plant species and
assessment of herbicide-induced damage across diverse environments. While
general-purpose vision foundation models have shown promising results in
complex visual domains, their performance can be limited in agriculture, where
fine-grained distinctions between species and damage types are critical.
In this work, we adapt a general-purpose vision foundation model to herbicide
trial characterization. Trained using a self-supervised learning approach on a
large, curated agricultural dataset, the model learns rich and transferable
representations optimized for herbicide trials images.
Our domain-specific model significantly outperforms the best general-purpose
foundation model in both species identification (F1 score improvement from 0.91
to 0.94) and damage classification (from 0.26 to 0.33). Under unseen conditions
(new locations and other time), it achieves even greater gains (species
identification from 0.56 to 0.66; damage classification from 0.17 to 0.27). In
domain-shift scenarios, such as drone imagery, it maintains strong performance
(species classification from 0.49 to 0.60).
Additionally, we show that domain-specific pretraining enhances segmentation
accuracy, particularly in low-annotation regimes. An annotation-efficiency
analysis reveals that, under unseen conditions, the domain-specific model
achieves 5.4% higher F1 score than the general-purpose model, while using 80%
fewer labeled samples.
These results demonstrate the generalization capabilities of domain-specific
foundation models and their potential to significantly reduce manual annotation
efforts, offering a scalable and automated solution for herbicide trial
analysis.
☆ FastGS: Training 3D Gaussian Splatting in 100 Seconds
The dominant 3D Gaussian splatting (3DGS) acceleration methods fail to
properly regulate the number of Gaussians during training, causing redundant
computational time overhead. In this paper, we propose FastGS, a novel, simple,
and general acceleration framework that fully considers the importance of each
Gaussian based on multi-view consistency, efficiently solving the trade-off
between training time and rendering quality. We innovatively design a
densification and pruning strategy based on multi-view consistency, dispensing
with the budgeting mechanism. Extensive experiments on Mip-NeRF 360, Tanks &
Temples, and Deep Blending datasets demonstrate that our method significantly
outperforms the state-of-the-art methods in training speed, achieving a
3.32$\times$ training acceleration and comparable rendering quality compared
with DashGaussian on the Mip-NeRF 360 dataset and a 15.45$\times$ acceleration
compared with vanilla 3DGS on the Deep Blending dataset. We demonstrate that
FastGS exhibits strong generality, delivering 2-7$\times$ training acceleration
across various tasks, including dynamic scene reconstruction, surface
reconstruction, sparse-view reconstruction, large-scale reconstruction, and
simultaneous localization and mapping. The project page is available at
https://fastgs.github.io/
comment: Project page: https://fastgs.github.io/
☆ DINOv2 Driven Gait Representation Learning for Video-Based Visible-Infrared Person Re-identification
Video-based Visible-Infrared person re-identification (VVI-ReID) aims to
retrieve the same pedestrian across visible and infrared modalities from video
sequences. Existing methods tend to exploit modality-invariant visual features
but largely overlook gait features, which are not only modality-invariant but
also rich in temporal dynamics, thus limiting their ability to model the
spatiotemporal consistency essential for cross-modal video matching. To address
these challenges, we propose a DINOv2-Driven Gait Representation Learning
(DinoGRL) framework that leverages the rich visual priors of DINOv2 to learn
gait features complementary to appearance cues, facilitating robust
sequence-level representations for cross-modal retrieval. Specifically, we
introduce a Semantic-Aware Silhouette and Gait Learning (SASGL) model, which
generates and enhances silhouette representations with general-purpose semantic
priors from DINOv2 and jointly optimizes them with the ReID objective to
achieve semantically enriched and task-adaptive gait feature learning.
Furthermore, we develop a Progressive Bidirectional Multi-Granularity
Enhancement (PBMGE) module, which progressively refines feature representations
by enabling bidirectional interactions between gait and appearance streams
across multiple spatial granularities, fully leveraging their complementarity
to enhance global representations with rich local details and produce highly
discriminative features. Extensive experiments on HITSZ-VCM and BUPT datasets
demonstrate the superiority of our approach, significantly outperforming
existing state-of-the-art methods.
☆ Proto-LeakNet: Towards Signal-Leak Aware Attribution in Synthetic Human Face Imagery
The growing sophistication of synthetic image and deepfake generation models
has turned source attribution and authenticity verification into a critical
challenge for modern computer vision systems. Recent studies suggest that
diffusion pipelines unintentionally imprint persistent statistical traces,
known as signal leaks, within their outputs, particularly in latent
representations. Building on this observation, we propose Proto-LeakNet, a
signal-leak-aware and interpretable attribution framework that integrates
closed-set classification with a density-based open-set evaluation on the
learned embeddings, enabling analysis of unseen generators without retraining.
Operating in the latent domain of diffusion models, our method re-simulates
partial forward diffusion to expose residual generator-specific cues. A
temporal attention encoder aggregates multi-step latent features, while a
feature-weighted prototype head structures the embedding space and enables
transparent attribution. Trained solely on closed data and achieving a Macro
AUC of 98.13%, Proto-LeakNet learns a latent geometry that remains robust under
post-processing, surpassing state-of-the-art methods, and achieves strong
separability between known and unseen generators. These results demonstrate
that modeling signal-leak bias in latent space enables reliable and
interpretable AI-image and deepfake forensics. The code for the whole work will
be available upon submission.
comment: 13 pages, 6 figures, 5 tables
☆ MedSapiens: Taking a Pose to Rethink Medical Imaging Landmark Detection
Marawan Elbatel, Anbang Wang, Keyuan Liu, Kaouther Mouheb, Enrique Almar-Munoz, Lizhuo Lin, Yanqi Yang, Karim Lekadir, Xiaomeng Li
This paper does not introduce a novel architecture; instead, it revisits a
fundamental yet overlooked baseline: adapting human-centric foundation models
for anatomical landmark detection in medical imaging. While landmark detection
has traditionally relied on domain-specific models, the emergence of
large-scale pre-trained vision models presents new opportunities. In this
study, we investigate the adaptation of Sapiens, a human-centric foundation
model designed for pose estimation, to medical imaging through multi-dataset
pretraining, establishing a new state of the art across multiple datasets. Our
proposed model, MedSapiens, demonstrates that human-centric foundation models,
inherently optimized for spatial pose localization, provide strong priors for
anatomical landmark detection, yet this potential has remained largely
untapped. We benchmark MedSapiens against existing state-of-the-art models,
achieving up to 5.26% improvement over generalist models and up to 21.81%
improvement over specialist models in the average success detection rate (SDR).
To further assess MedSapiens adaptability to novel downstream tasks with few
annotations, we evaluate its performance in limited-data settings, achieving
2.69% improvement over the few-shot state of the art in SDR. Code and model
weights are available at https://github.com/xmed-lab/MedSapiens .
☆ AStF: Motion Style Transfer via Adaptive Statistics Fusor
Human motion style transfer allows characters to appear less rigidity and
more realism with specific style. Traditional arbitrary image style transfer
typically process mean and variance which is proved effective. Meanwhile,
similar methods have been adapted for motion style transfer. However, due to
the fundamental differences between images and motion, relying on mean and
variance is insufficient to fully capture the complex dynamic patterns and
spatiotemporal coherence properties of motion data. Building upon this, our key
insight is to bring two more coefficient, skewness and kurtosis, into the
analysis of motion style. Specifically, we propose a novel Adaptive Statistics
Fusor (AStF) which consists of Style Disentanglement Module (SDM) and
High-Order Multi-Statistics Attention (HOS-Attn). We trained our AStF in
conjunction with a Motion Consistency Regularization (MCR) discriminator.
Experimental results show that, by providing a more comprehensive model of the
spatiotemporal statistical patterns inherent in dynamic styles, our proposed
AStF shows proficiency superiority in motion style transfers over
state-of-the-arts. Our code and model are available at
https://github.com/CHMimilanlan/AStF.
☆ Covariance Descriptors Meet General Vision Encoders: Riemannian Deep Learning for Medical Image Classification IEEE
Covariance descriptors capture second-order statistics of image features.
They have shown strong performance in general computer vision tasks, but remain
underexplored in medical imaging. We investigate their effectiveness for both
conventional and learning-based medical image classification, with a particular
focus on SPDNet, a classification network specifically designed for symmetric
positive definite (SPD) matrices. We propose constructing covariance
descriptors from features extracted by pre-trained general vision encoders
(GVEs) and comparing them with handcrafted descriptors. Two GVEs - DINOv2 and
MedSAM - are evaluated across eleven binary and multi-class datasets from the
MedMNSIT benchmark. Our results show that covariance descriptors derived from
GVE features consistently outperform those derived from handcrafted features.
Moreover, SPDNet yields superior performance to state-of-the-art methods when
combined with DINOv2 features. Our findings highlight the potential of
combining covariance descriptors with powerful pretrained vision encoders for
medical image analysis.
comment: Preprint. Submitted to the IEEE International Symposium on Biomedical
Imaging (ISBI) 2026
☆ Systematic Evaluation of Preprocessing Techniques for Accurate Image Registration in Digital Pathology
Image registration refers to the process of spatially aligning two or more
images by mapping them into a common coordinate system, so that corresponding
anatomical or tissue structures are matched across images. In digital
pathology, registration enables direct comparison and integration of
information from different stains or imaging modalities, sup-porting
applications such as biomarker analysis and tissue reconstruction. Accurate
registration of images from different modalities is an essential step in
digital pathology. In this study, we investigated how various color
transformation techniques affect image registration between hematoxylin and
eosin (H&E) stained images and non-linear multimodal images. We used a dataset
of 20 tissue sample pairs, with each pair undergoing several preprocessing
steps, including different color transformation (CycleGAN, Macenko, Reinhard,
Vahadane), inversion, contrast adjustment, intensity normalization, and
denoising. All images were registered using the VALIS registration method,
which first applies rigid registration and then performs non-rigid registration
in two steps on both low and high-resolution images. Registration performance
was evaluated using the relative Target Registration Error (rTRE). We reported
the median of median rTRE values (MMrTRE) and the average of median rTRE values
(AMrTRE) for each method. In addition, we performed a custom point-based
evaluation using ten manually selected key points. Registration was done
separately for two scenarios, using either the original or inverted multimodal
images. In both scenarios, CycleGAN color transformation achieved the lowest
registration errors, while the other methods showed higher errors. These
findings show that applying color transformation before registration improves
alignment between images from different modalities and supports more reliable
analysis in digital pathology.
comment: 14 pages, 7 Figures
☆ Seeing Straight: Document Orientation Detection for Efficient OCR
Suranjan Goswami, Abhinav Ravi, Raja Kolla, Ali Faraz, Shaharukh Khan, Akash, Chandra Khatri, Shubham Agarwal
Despite significant advances in document understanding, determining the
correct orientation of scanned or photographed documents remains a critical
pre-processing step in the real world settings. Accurate rotation correction is
essential for enhancing the performance of downstream tasks such as Optical
Character Recognition (OCR) where misalignment commonly arises due to user
errors, particularly incorrect base orientations of the camera during capture.
In this study, we first introduce OCR-Rotation-Bench (ORB), a new benchmark for
evaluating OCR robustness to image rotations, comprising (i) ORB-En, built from
rotation-transformed structured and free-form English OCR datasets, and (ii)
ORB-Indic, a novel multilingual set spanning 11 Indic mid to low-resource
languages. We also present a fast, robust and lightweight rotation
classification pipeline built on the vision encoder of Phi-3.5-Vision model
with dynamic image cropping, fine-tuned specifically for 4-class rotation task
in a standalone fashion. Our method achieves near-perfect 96% and 92% accuracy
on identifying the rotations respectively on both the datasets. Beyond
classification, we demonstrate the critical role of our module in boosting OCR
performance: closed-source (up to 14%) and open-weights models (up to 4x) in
the simulated real-world setting.
☆ Learning from Online Videos at Inference Time for Computer-Use Agents
Yujian Liu, Ze Wang, Hao Chen, Ximeng Sun, Xiaodong Yu, Jialian Wu, Jiang Liu, Emad Barsoum, Zicheng Liu, Shiyu Chang
Computer-use agents can operate computers and automate laborious tasks, but
despite recent rapid progress, they still lag behind human users, especially
when tasks require domain-specific procedural knowledge about particular
applications, platforms, and multi-step workflows. Humans can bridge this gap
by watching video tutorials: we search, skim, and selectively imitate short
segments that match our current subgoal. In this paper, we study how to enable
computer-use agents to learn from online videos at inference time effectively.
We propose a framework that retrieves and filters tutorial videos, converts
them into structured demonstration trajectories, and dynamically selects
trajectories as in-context guidance during execution. Particularly, using a
VLM, we infer UI actions, segment videos into short subsequences of actions,
and assign each subsequence a textual objective. At inference time, a two-stage
selection mechanism dynamically chooses a single trajectory to add in context
at each step, focusing the agent on the most helpful local guidance for its
next decision. Experiments on two widely used benchmarks show that our
framework consistently outperforms strong base agents and variants that use
only textual tutorials or transcripts. Analyses highlight the importance of
trajectory segmentation and selection, action filtering, and visual
information, suggesting that abundant online videos can be systematically
distilled into actionable guidance that improves computer-use agents at
inference time. Our code is available at
https://github.com/UCSB-NLP-Chang/video_demo.
☆ DMSORT: An efficient parallel maritime multi-object tracking architecture for unmanned vessel platforms
Accurate perception of the marine environment through robust multi-object
tracking (MOT) is essential for ensuring safe vessel navigation and effective
maritime surveillance. However, the complicated maritime environment often
causes camera motion and subsequent visual degradation, posing significant
challenges to MOT. To address this challenge, we propose an efficient
Dual-branch Maritime SORT (DMSORT) method for maritime MOT. The core of the
framework is a parallel tracker with affine compensation, which incorporates an
object detection and re-identification (ReID) branch, along with a dedicated
branch for dynamic camera motion estimation. Specifically, a Reversible
Columnar Detection Network (RCDN) is integrated into the detection module to
leverage multi-level visual features for robust object detection. Furthermore,
a lightweight Transformer-based appearance extractor (Li-TAE) is designed to
capture global contextual information and generate robust appearance features.
Another branch decouples platform-induced and target-intrinsic motion by
constructing a projective transformation, applying platform-motion compensation
within the Kalman filter, and thereby stabilizing true object trajectories.
Finally, a clustering-optimized feature fusion module effectively combines
motion and appearance cues to ensure identity consistency under noise,
occlusion, and drift. Extensive evaluations on the Singapore Maritime Dataset
demonstrate that DMSORT achieves state-of-the-art performance. Notably, DMSORT
attains the fastest runtime among existing ReID-based MOT frameworks while
maintaining high identity consistency and robustness to jitter and occlusion.
Code is available at:
https://github.com/BiscuitsLzy/DMSORT-An-efficient-parallel-maritime-multi-object-tracking-architecture-.
comment: Updated version of the Ocean Engineering (Elsevier, 2025) paper with
minor corrections
☆ Automated Tennis Player and Ball Tracking with Court Keypoints Detection (Hawk Eye System)
This study presents a complete pipeline for automated tennis match analysis.
Our framework integrates multiple deep learning models to detect and track
players and the tennis ball in real time, while also identifying court
keypoints for spatial reference. Using YOLOv8 for player detection, a
custom-trained YOLOv5 model for ball tracking, and a ResNet50-based
architecture for court keypoint detection, our system provides detailed
analytics including player movement patterns, ball speed, shot accuracy, and
player reaction times. The experimental results demonstrate robust performance
in varying court conditions and match scenarios. The model outputs an annotated
video along with detailed performance metrics, enabling coaches, broadcasters,
and players to gain actionable insights into the dynamics of the game.
comment: 14 pages, 11 figures, planning to submit for a coneference
☆ Text to Sketch Generation with Multi-Styles NeurIPS 2025
Recent advances in vision-language models have facilitated progress in sketch
generation. However, existing specialized methods primarily focus on generic
synthesis and lack mechanisms for precise control over sketch styles. In this
work, we propose a training-free framework based on diffusion models that
enables explicit style guidance via textual prompts and referenced style
sketches. Unlike previous style transfer methods that overwrite key and value
matrices in self-attention, we incorporate the reference features as auxiliary
information with linear smoothing and leverage a style-content guidance
mechanism. This design effectively reduces content leakage from reference
sketches and enhances synthesis quality, especially in cases with low
structural similarity between reference and target sketches. Furthermore, we
extend our framework to support controllable multi-style generation by
integrating features from multiple reference sketches, coordinated via a joint
AdaIN module. Extensive experiments demonstrate that our approach achieves
high-quality sketch generation with accurate style alignment and improved
flexibility in style control. The official implementation of M3S is available
at https://github.com/CMACH508/M3S.
comment: Accepted by NeurIPS 2025
☆ Tortoise and Hare Guidance: Accelerating Diffusion Model Inference with Multirate Integration NeurIPS 2025
In this paper, we propose Tortoise and Hare Guidance (THG), a training-free
strategy that accelerates diffusion sampling while maintaining high-fidelity
generation. We demonstrate that the noise estimate and the additional guidance
term exhibit markedly different sensitivity to numerical error by reformulating
the classifier-free guidance (CFG) ODE as a multirate system of ODEs. Our
error-bound analysis shows that the additional guidance branch is more robust
to approximation, revealing substantial redundancy that conventional solvers
fail to exploit. Building on this insight, THG significantly reduces the
computation of the additional guidance: the noise estimate is integrated with
the tortoise equation on the original, fine-grained timestep grid, while the
additional guidance is integrated with the hare equation only on a coarse grid.
We also introduce (i) an error-bound-aware timestep sampler that adaptively
selects step sizes and (ii) a guidance-scale scheduler that stabilizes large
extrapolation spans. THG reduces the number of function evaluations (NFE) by up
to 30% with virtually no loss in generation fidelity ($\Delta$ImageReward
$\leq$ 0.032) and outperforms state-of-the-art CFG-based training-free
accelerators under identical computation budgets. Our findings highlight the
potential of multirate formulations for diffusion solvers, paving the way for
real-time high-quality image synthesis without any model retraining. The source
code is available at https://github.com/yhlee-add/THG.
comment: 21 pages, 8 figures. NeurIPS 2025. Project page:
https://yhlee-add.github.io/THG
☆ SpatialLock: Precise Spatial Control in Text-to-Image Synthesis
Text-to-Image (T2I) synthesis has made significant advancements in recent
years, driving applications such as generating datasets automatically. However,
precise control over object localization in generated images remains a
challenge. Existing methods fail to fully utilize positional information,
leading to an inadequate understanding of object spatial layouts. To address
this issue, we propose SpatialLock, a novel framework that leverages perception
signals and grounding information to jointly control the generation of spatial
locations. SpatialLock incorporates two components: Position-Engaged Injection
(PoI) and Position-Guided Learning (PoG). PoI directly integrates spatial
information through an attention layer, encouraging the model to learn the
grounding information effectively. PoG employs perception-based supervision to
further refine object localization. Together, these components enable the model
to generate objects with precise spatial arrangements and improve the visual
quality of the generated images. Experiments show that SpatialLock sets a new
state-of-the-art for precise object positioning, achieving IOU scores above 0.9
across multiple datasets.
comment: Work in progress
☆ When Swin Transformer Meets KANs: An Improved Transformer Architecture for Medical Image Segmentation
Medical image segmentation is critical for accurate diagnostics and treatment
planning, but remains challenging due to complex anatomical structures and
limited annotated training data. CNN-based segmentation methods excel at local
feature extraction, but struggle with modeling long-range dependencies.
Transformers, on the other hand, capture global context more effectively, but
are inherently data-hungry and computationally expensive. In this work, we
introduce UKAST, a U-Net like architecture that integrates rational-function
based Kolmogorov-Arnold Networks (KANs) into Swin Transformer encoders. By
leveraging rational base functions and Group Rational KANs (GR-KANs) from the
Kolmogorov-Arnold Transformer (KAT), our architecture addresses the
inefficiencies of vanilla spline-based KANs, yielding a more expressive and
data-efficient framework with reduced FLOPs and only a very small increase in
parameter count compared to SwinUNETR. UKAST achieves state-of-the-art
performance on four diverse 2D and 3D medical image segmentation benchmarks,
consistently surpassing both CNN- and Transformer-based baselines. Notably, it
attains superior accuracy in data-scarce settings, alleviating the data-hungry
limitations of standard Vision Transformers. These results show the potential
of KAN-enhanced Transformers to advance data-efficient medical image
segmentation. Code is available at: https://github.com/nsapkota417/UKAST
☆ Adversarial and Score-Based CT Denoising: CycleGAN vs Noise2Score
We study CT image denoising in the unpaired and self-supervised regimes by
evaluating two strong, training-data-efficient paradigms: a CycleGAN-based
residual translator and a Noise2Score (N2S) score-matching denoiser. Under a
common evaluation protocol, a configuration sweep identifies a simple standard
U-Net backbone within CycleGAN (lambda_cycle = 30, lambda_iden = 2, ngf = ndf =
64) as the most reliable setting; we then train it to convergence with a longer
schedule. The selected CycleGAN improves the noisy input from 34.66 dB / 0.9234
SSIM to 38.913 dB / 0.971 SSIM and attains an estimated score of 1.9441 and an
unseen-set (Kaggle leaderboard) score of 1.9343. Noise2Score, while slightly
behind in absolute PSNR / SSIM, achieves large gains over very noisy inputs,
highlighting its utility when clean pairs are unavailable. Overall, CycleGAN
offers the strongest final image quality, whereas Noise2Score provides a robust
pair-free alternative with competitive performance. Source code is available at
https://github.com/hanifsyarubany/CT-Scan-Image-Denoising-using-CycleGAN-and-Noise2Score.
☆ Unveiling Deep Semantic Uncertainty Perception for Language-Anchored Multi-modal Vision-Brain Alignment
Unveiling visual semantics from neural signals such as EEG, MEG, and fMRI
remains a fundamental challenge due to subject variability and the entangled
nature of visual features. Existing approaches primarily align neural activity
directly with visual embeddings, but visual-only representations often fail to
capture latent semantic dimensions, limiting interpretability and deep
robustness. To address these limitations, we propose Bratrix, the first
end-to-end framework to achieve multimodal Language-Anchored Vision-Brain
alignment. Bratrix decouples visual stimuli into hierarchical visual and
linguistic semantic components, and projects both visual and brain
representations into a shared latent space, enabling the formation of aligned
visual-language and brain-language embeddings. To emulate human-like perceptual
reliability and handle noisy neural signals, Bratrix incorporates a novel
uncertainty perception module that applies uncertainty-aware weighting during
alignment. By leveraging learnable language-anchored semantic matrices to
enhance cross-modal correlations and employing a two-stage training strategy of
single-modality pretraining followed by multimodal fine-tuning, Bratrix-M
improves alignment precision. Extensive experiments on EEG, MEG, and fMRI
benchmarks demonstrate that Bratrix improves retrieval, reconstruction, and
captioning performance compared to state-of-the-art methods, specifically
surpassing 14.3% in 200-way EEG retrieval task. Code and model are available.
comment: 30 pages, 16 figures, under review as a conference paper
☆ A Hybrid Deep Learning Model for Robust Biometric Authentication from Low-Frame-Rate PPG Signals IEEE
Photoplethysmography (PPG) signals, which measure changes in blood volume in
the skin using light, have recently gained attention in biometric
authentication because of their non-invasive acquisition, inherent liveness
detection, and suitability for low-cost wearable devices. However, PPG signal
quality is challenged by motion artifacts, illumination changes, and
inter-subject physiological variability, making robust feature extraction and
classification crucial. This study proposes a lightweight and cost-effective
biometric authentication framework based on PPG signals extracted from
low-frame-rate fingertip videos. The CFIHSR dataset, comprising PPG recordings
from 46 subjects at a sampling rate of 14 Hz, is employed for evaluation. The
raw PPG signals undergo a standard preprocessing pipeline involving baseline
drift removal, motion artifact suppression using Principal Component Analysis
(PCA), bandpass filtering, Fourier-based resampling, and amplitude
normalization. To generate robust representations, each one-dimensional PPG
segment is converted into a two-dimensional time-frequency scalogram via the
Continuous Wavelet Transform (CWT), effectively capturing transient
cardiovascular dynamics. We developed a hybrid deep learning model, termed
CVT-ConvMixer-LSTM, by combining spatial features from the Convolutional Vision
Transformer (CVT) and ConvMixer branches with temporal features from a Long
Short-Term Memory network (LSTM). The experimental results on 46 subjects
demonstrate an authentication accuracy of 98%, validating the robustness of the
model to noise and variability between subjects. Due to its efficiency,
scalability, and inherent liveness detection capability, the proposed system is
well-suited for real-world mobile and embedded biometric security applications.
comment: This work has been submitted to IEEE Transactions on Biometrics,
Behavior, and Identity Science (TBIOM) for possible publication
☆ Near-Lossless 3D Voxel Representation Free from Iso-surface
Yihao Luo, Xianglong He, Chuanyu Pan, Yiwen Chen, Jiaqi Wu, Yangguang Li, Wanli Ouyang, Yuanming Hu, Guang Yang, ChoonHwai Yap
Accurate and efficient voxelized representations of 3D meshes are the
foundation of 3D reconstruction and generation. However, existing
representations based on iso-surface heavily rely on water-tightening or
rendering optimization, which inevitably compromise geometric fidelity. We
propose Faithful Contouring, a sparse voxelized representation that supports
2048+ resolutions for arbitrary meshes, requiring neither converting meshes to
field functions nor extracting the isosurface during remeshing. It achieves
near-lossless fidelity by preserving sharpness and internal structures, even
for challenging cases with complex geometry and topology. The proposed method
also shows flexibility for texturing, manipulation, and editing. Beyond
representation, we design a dual-mode autoencoder for Faithful Contouring,
enabling scalable and detail-preserving shape reconstruction. Extensive
experiments show that Faithful Contouring surpasses existing methods in
accuracy and efficiency for both representation and reconstruction. For direct
representation, it achieves distance errors at the $10^{-5}$ level; for mesh
reconstruction, it yields a 93\% reduction in Chamfer Distance and a 35\%
improvement in F-score over strong baselines, confirming superior fidelity as a
representation for 3D learning tasks.
☆ MedDChest: A Content-Aware Multimodal Foundational Vision Model for Thoracic Imaging
The performance of vision models in medical imaging is often hindered by the
prevailing paradigm of fine-tuning backbones pre-trained on out-of-domain
natural images. To address this fundamental domain gap, we propose MedDChest, a
new foundational Vision Transformer (ViT) model optimized specifically for
thoracic imaging. We pre-trained MedDChest from scratch on a massive, curated,
multimodal dataset of over 1.2 million images, encompassing different
modalities including Chest X-ray and Computed Tomography (CT) compiled from 10
public sources. A core technical contribution of our work is Guided Random
Resized Crops, a novel content-aware data augmentation strategy that biases
sampling towards anatomically relevant regions, overcoming the inefficiency of
standard cropping techniques on medical scans. We validate our model's
effectiveness by fine-tuning it on a diverse set of downstream diagnostic
tasks. Comprehensive experiments empirically demonstrate that MedDChest
significantly outperforms strong, publicly available ImageNet-pretrained
models. By establishing the superiority of large-scale, in-domain pre-training
combined with domain-specific data augmentation, MedDChest provides a powerful
and robust feature extractor that serves as a significantly better starting
point for a wide array of thoracic diagnostic tasks. The model weights will be
made publicly available to foster future research and applications.
comment: 10 pages, 2 figures
☆ GNN-MoE: Context-Aware Patch Routing using GNNs for Parameter-Efficient Domain Generalization
Domain generalization (DG) seeks robust Vision Transformer (ViT) performance
on unseen domains. Efficiently adapting pretrained ViTs for DG is challenging;
standard fine-tuning is costly and can impair generalization. We propose
GNN-MoE, enhancing Parameter-Efficient Fine-Tuning (PEFT) for DG with a
Mixture-of-Experts (MoE) framework using efficient Kronecker adapters. Instead
of token-based routing, a novel Graph Neural Network (GNN) router (GCN, GAT,
SAGE) operates on inter-patch graphs to dynamically assign patches to
specialized experts. This context-aware GNN routing leverages inter-patch
relationships for better adaptation to domain shifts. GNN-MoE achieves
state-of-the-art or competitive DG benchmark performance with high parameter
efficiency, highlighting the utility of graph-based contextual routing for
robust, lightweight DG.
comment: 6 pages, 3 figures
☆ PhysCorr: Dual-Reward DPO for Physics-Constrained Text-to-Video Generation with Automated Preference Selection
Recent advances in text-to-video generation have achieved impressive
perceptual quality, yet generated content often violates fundamental principles
of physical plausibility - manifesting as implausible object dynamics,
incoherent interactions, and unrealistic motion patterns. Such failures hinder
the deployment of video generation models in embodied AI, robotics, and
simulation-intensive domains. To bridge this gap, we propose PhysCorr, a
unified framework for modeling, evaluating, and optimizing physical consistency
in video generation. Specifically, we introduce PhysicsRM, the first
dual-dimensional reward model that quantifies both intra-object stability and
inter-object interactions. On this foundation, we develop PhyDPO, a novel
direct preference optimization pipeline that leverages contrastive feedback and
physics-aware reweighting to guide generation toward physically coherent
outputs. Our approach is model-agnostic and scalable, enabling seamless
integration into a wide range of video diffusion and transformer-based
backbones. Extensive experiments across multiple benchmarks demonstrate that
PhysCorr achieves significant improvements in physical realism while preserving
visual fidelity and semantic alignment. This work takes a critical step toward
physically grounded and trustworthy video generation.
☆ CaRF: Enhancing Multi-View Consistency in Referring 3D Gaussian Splatting Segmentation
Referring 3D Gaussian Splatting Segmentation (R3DGS) aims to interpret
free-form language expressions and localize the corresponding 3D regions in
Gaussian fields. While recent advances have introduced cross-modal alignment
between language and 3D geometry, existing pipelines still struggle with
cross-view consistency due to their reliance on 2D rendered pseudo supervision
and view specific feature learning. In this work, we present Camera Aware
Referring Field (CaRF), a fully differentiable framework that operates directly
in the 3D Gaussian space and achieves multi view consistency. Specifically,
CaRF introduces Gaussian Field Camera Encoding (GFCE), which incorporates
camera geometry into Gaussian text interactions to explicitly model view
dependent variations and enhance geometric reasoning. Building on this, In
Training Paired View Supervision (ITPVS) is proposed to align per Gaussian
logits across calibrated views during training, effectively mitigating single
view overfitting and exposing inter view discrepancies for optimization.
Extensive experiments on three representative benchmarks demonstrate that CaRF
achieves average improvements of 16.8%, 4.3%, and 2.0% in mIoU over state of
the art methods on the Ref LERF, LERF OVS, and 3D OVS datasets, respectively.
Moreover, this work promotes more reliable and view consistent 3D scene
understanding, with potential benefits for embodied AI, AR/VR interaction, and
autonomous perception.
☆ Simple 3D Pose Features Support Human and Machine Social Scene Understanding
Humans can quickly and effortlessly extract a variety of information about
others' social interactions from visual input, ranging from visuospatial cues
like whether two people are facing each other to higher-level information. Yet,
the computations supporting these abilities remain poorly understood, and
social interaction recognition continues to challenge even the most advanced AI
vision systems. Here, we hypothesized that humans rely on 3D visuospatial pose
information to make social interaction judgments, which is absent in most AI
vision models. To test this, we combined state-of-the-art pose and depth
estimation algorithms to extract 3D joint positions of people in short video
clips depicting everyday human actions and compared their ability to predict
human social interaction judgments with current AI vision models. Strikingly,
3D joint positions outperformed most current AI vision models, revealing that
key social information is available in explicit body position but not in the
learned features of most vision models, including even the layer-wise
embeddings of the pose models used to extract joint positions. To uncover the
critical pose features humans use to make social judgments, we derived a
compact set of 3D social pose features describing only the 3D position and
direction of faces in the videos. We found that these minimal descriptors
matched the predictive strength of the full set of 3D joints and significantly
improved the performance of off-the-shelf AI vision models when combined with
their embeddings. Moreover, the degree to which 3D social pose features were
represented in each off-the-shelf AI vision model predicted the model's ability
to match human social judgments. Together, our findings provide strong evidence
that human social scene understanding relies on explicit representations of 3D
pose and can be supported by simple, structured visuospatial primitives.
comment: 28 pages, 6 figures
☆ Room Envelopes: A Synthetic Dataset for Indoor Layout Reconstruction from Images
Modern scene reconstruction methods are able to accurately recover 3D
surfaces that are visible in one or more images. However, this leads to
incomplete reconstructions, missing all occluded surfaces. While much progress
has been made on reconstructing entire objects given partial observations using
generative models, the structural elements of a scene, like the walls, floors
and ceilings, have received less attention. We argue that these scene elements
should be relatively easy to predict, since they are typically planar,
repetitive and simple, and so less costly approaches may be suitable. In this
work, we present a synthetic dataset -- Room Envelopes -- that facilitates
progress on this task by providing a set of RGB images and two associated
pointmaps for each image: one capturing the visible surface and one capturing
the first surface once fittings and fixtures are removed, that is, the
structural layout. As we show, this enables direct supervision for feed-forward
monocular geometry estimators that predict both the first visible surface and
the first layout surface. This confers an understanding of the scene's extent,
as well as the shape and location of its objects.
☆ A Linear Fractional Transformation Model and Calibration Method for Light Field Camera
Accurate calibration of internal parameters is a crucial yet challenging
prerequisite for 3D reconstruction using light field cameras. In this paper, we
propose a linear fractional transformation(LFT) parameter $\alpha$ to decoupled
the main lens and micro lens array (MLA). The proposed method includes an
analytical solution based on least squares, followed by nonlinear refinement.
The method for detecting features from the raw images is also introduced.
Experimental results on both physical and simulated data have verified the
performance of proposed method. Based on proposed model, the simulation of raw
light field images becomes faster, which is crucial for data-driven deep
learning methods. The corresponding code can be obtained from the author's
website.
☆ Improving Multi-View Reconstruction via Texture-Guided Gaussian-Mesh Joint Optimization
Reconstructing real-world objects from multi-view images is essential for
applications in 3D editing, AR/VR, and digital content creation. Existing
methods typically prioritize either geometric accuracy (Multi-View Stereo) or
photorealistic rendering (Novel View Synthesis), often decoupling geometry and
appearance optimization, which hinders downstream editing tasks. This paper
advocates an unified treatment on geometry and appearance optimization for
seamless Gaussian-mesh joint optimization. More specifically, we propose a
novel framework that simultaneously optimizes mesh geometry (vertex positions
and faces) and vertex colors via Gaussian-guided mesh differentiable rendering,
leveraging photometric consistency from input images and geometric
regularization from normal and depth maps. The obtained high-quality 3D
reconstruction can be further exploit in down-stream editing tasks, such as
relighting and shape deformation. The code will be publicly available upon
acceptance.
comment: 10 pages
☆ Adaptive Temporal Refinement: Continuous Depth Allocation and Distance Regression for Efficient Action Localization
Temporal action localization requires precise boundary detection; however,
current methods apply uniform computation despite significant variations in
difficulty across boundaries. We present two complementary contributions.
First, Boundary Distance Regression (BDR) provides information-theoretically
optimal localization through signed-distance regression rather than
classification, achieving 43\% sharper boundary peaks. BDR retrofits to
existing methods with approximately 50 lines of code, yielding consistent 1.8
to 3.1\% mAP@0.7 improvements across diverse architectures. Second, Adaptive
Temporal Refinement (ATR) allocates computation via continuous depth selection
$\tau \in [0,1]$, enabling end-to-end differentiable optimization without
reinforcement learning. On THUMOS14, ATR achieves 56.5\% mAP@0.7 at 162G FLOPs,
compared to 53.6\% at 198G for uniform processing, providing a 2.9\%
improvement with 18\% less compute. Gains scale with boundary heterogeneity,
showing 4.2\% improvement on short actions. Training cost is mitigated via
knowledge distillation, with lightweight students retaining 99\% performance at
baseline cost. Results are validated across four benchmarks with rigorous
statistical testing.
☆ NVIDIA Nemotron Nano V2 VL
NVIDIA, :, Amala Sanjay Deshmukh, Kateryna Chumachenko, Tuomas Rintamaki, Matthieu Le, Tyler Poon, Danial Mohseni Taheri, Ilia Karmanov, Guilin Liu, Jarno Seppanen, Guo Chen, Karan Sapra, Zhiding Yu, Adi Renduchintala, Charles Wang, Peter Jin, Arushi Goel, Mike Ranzinger, Lukas Voegtle, Philipp Fischer, Timo Roman, Wei Ping, Boxin Wang, Zhuolin Yang, Nayeon Lee, Shaokun Zhang, Fuxiao Liu, Zhiqi Li, Di Zhang, Greg Heinrich, Hongxu, Yin, Song Han, Pavlo Molchanov, Parth Mannan, Yao Xu, Jane Polak Scowcroft, Tom Balough, Subhashree Radhakrishnan, Paris Zhang, Sean Cha, Ratnesh Kumar, Zaid Pervaiz Bhat, Jian Zhang, Darragh Hanley, Pritam Biswas, Jesse Oliver, Kevin Vasques, Roger Waleffe, Duncan Riach, Oluwatobi Olabiyi, Ameya Sunil Mahabaleshwarkar, Bilal Kartal, Pritam Gundecha, Khanh Nguyen, Alexandre Milesi, Eugene Khvedchenia, Ran Zilberstein, Ofri Masad, Natan Bagrov, Nave Assaf, Tomer Asida, Daniel Afrimi, Amit Zuker, Netanel Haber, Zhiyu Cheng, Jingyu, Xin, Di, Wu, Nik Spirin, Maryam Moosaei, Roman Ageev, Vanshil Atul Shah, Yuting Wu, Daniel Korzekwa, Unnikrishnan Kizhakkemadam Sreekumar, Wanli Jiang, Padmavathy Subramanian, Alejandra Rico, Sandip Bhaskar, Saeid Motiian, Kedi Wu, Annie Surla, Chia-Chih Chen, Hayden Wolff, Matthew Feinberg, Melissa Corpuz, Marek Wawrzos, Eileen Long, Aastha Jhunjhunwala, Paul Hendricks, Farzan Memarian, Benika Hall, Xin-Yu Wang, David Mosallanezhad, Soumye Singhal, Luis Vega, Katherine Cheung, Krzysztof Pawelec, Michael Evans, Katherine Luna, Jie Lou, Erick Galinkin, Akshay Hazare, Kaustubh Purandare, Ann Guan, Anna Warno, Chen Cui, Yoshi Suhara, Shibani Likhite, Seph Mard, Meredith Price, Laya Sleiman, Saori Kaji, Udi Karpas, Kari Briski, Joey Conway, Michael Lightstone, Jan Kautz, Mohammad Shoeybi, Mostofa Patwary, Jonathen Cohen, Oleksii Kuchaiev, Andrew Tao, Bryan Catanzaro
We introduce Nemotron Nano V2 VL, the latest model of the Nemotron
vision-language series designed for strong real-world document understanding,
long video comprehension, and reasoning tasks. Nemotron Nano V2 VL delivers
significant improvements over our previous model,
Llama-3.1-Nemotron-Nano-VL-8B, across all vision and text domains through major
enhancements in model architecture, datasets, and training recipes. Nemotron
Nano V2 VL builds on Nemotron Nano V2, a hybrid Mamba-Transformer LLM, and
innovative token reduction techniques to achieve higher inference throughput in
long document and video scenarios. We are releasing model checkpoints in BF16,
FP8, and FP4 formats and sharing large parts of our datasets, recipes and
training code.
♻ ☆ TextRegion: Text-Aligned Region Tokens from Frozen Image-Text Models
Image-text models excel at image-level tasks but struggle with detailed
visual understanding. While these models provide strong visual-language
alignment, segmentation models like SAM2 offer precise spatial boundaries for
objects. To this end, we propose TextRegion, a simple, effective, and
training-free framework that combines the strengths of image-text models and
SAM2 to generate powerful text-aligned region tokens. These tokens enable
detailed visual understanding while preserving open-vocabulary capabilities.
They can be directly applied to various downstream tasks, including open-world
semantic segmentation, referring expression comprehension, and grounding. We
conduct extensive evaluations and consistently achieve superior or competitive
performance compared to state-of-the-art training-free methods. Additionally,
our framework is compatible with many image-text models, making it highly
practical and easily extensible as stronger models emerge. Code is available
at: https://github.com/avaxiao/TextRegion.
comment: Published in TMLR, with a J2C Certification
♻ ☆ Residual Kolmogorov-Arnold Network for Enhanced Deep Learning
Despite their immense success, deep convolutional neural networks (CNNs) can
be difficult to optimize and costly to train due to hundreds of layers within
the network depth. Conventional convolutional operations are fundamentally
limited by their linear nature along with fixed activations, where many layers
are needed to learn meaningful patterns in data. Because of the sheer size of
these networks, this approach is simply computationally inefficient, and poses
overfitting or gradient explosion risks, especially in small datasets. As a
result, we introduce a "plug-in" module, called Residual Kolmogorov-Arnold
Network (RKAN). Our module is highly compact, so it can be easily added into
any stage (level) of traditional deep networks, where it learns to integrate
supportive polynomial feature transformations to existing convolutional
frameworks. RKAN offers consistent improvements over baseline models in
different vision tasks and widely tested benchmarks, accomplishing cutting-edge
performance on them.
comment: Code is available at https://github.com/withray/residualKAN.git
♻ ☆ Particle-Grid Neural Dynamics for Learning Deformable Object Models from RGB-D Videos
Modeling the dynamics of deformable objects is challenging due to their
diverse physical properties and the difficulty of estimating states from
limited visual information. We address these challenges with a neural dynamics
framework that combines object particles and spatial grids in a hybrid
representation. Our particle-grid model captures global shape and motion
information while predicting dense particle movements, enabling the modeling of
objects with varied shapes and materials. Particles represent object shapes,
while the spatial grid discretizes the 3D space to ensure spatial continuity
and enhance learning efficiency. Coupled with Gaussian Splattings for visual
rendering, our framework achieves a fully learning-based digital twin of
deformable objects and generates 3D action-conditioned videos. Through
experiments, we demonstrate that our model learns the dynamics of diverse
objects -- such as ropes, cloths, stuffed animals, and paper bags -- from
sparse-view RGB-D recordings of robot-object interactions, while also
generalizing at the category level to unseen instances. Our approach
outperforms state-of-the-art learning-based and physics-based simulators,
particularly in scenarios with limited camera views. Furthermore, we showcase
the utility of our learned models in model-based planning, enabling
goal-conditioned object manipulation across a range of tasks. The project page
is available at https://kywind.github.io/pgnd .
comment: Project page: https://kywind.github.io/pgnd
♻ ☆ CREA: A Collaborative Multi-Agent Framework for Creative Image Editing and Generation NeurIPS'25
Creativity in AI imagery remains a fundamental challenge, requiring not only
the generation of visually compelling content but also the capacity to add
novel, expressive, and artistically rich transformations to images. Unlike
conventional editing tasks that rely on direct prompt-based modifications,
creative image editing requires an autonomous, iterative approach that balances
originality, coherence, and artistic intent. To address this, we introduce
CREA, a novel multi-agent collaborative framework that mimics the human
creative process. Our framework leverages a team of specialized AI agents who
dynamically collaborate to conceptualize, generate, critique, and enhance
images. Through extensive qualitative and quantitative evaluations, we
demonstrate that CREA significantly outperforms state-of-the-art methods in
diversity, semantic alignment, and creative transformation. To the best of our
knowledge, this is the first work to introduce the task of creative editing.
comment: Published at NeurIPS'25 Main Conference
♻ ☆ Information-driven design of imaging systems
Imaging systems have traditionally been designed to mimic the human eye and
produce visually interpretable measurements. Modern imaging systems, however,
process raw measurements computationally before or instead of human viewing. As
a result, the information content of raw measurements matters more than their
visual interpretability. Despite the importance of measurement information
content, current approaches for evaluating imaging system performance do not
quantify it: they instead either use alternative metrics that assess specific
aspects of measurement quality or assess measurements indirectly with
performance on secondary tasks.
We developed the theoretical foundations and a practical method to directly
quantify mutual information between noisy measurements and unknown objects. By
fitting probabilistic models to measurements and their noise characteristics,
our method estimates information by upper bounding its true value. By applying
gradient-based optimization to these estimates, we also developed a technique
for designing imaging systems called Information-Driven Encoder Analysis
Learning (IDEAL). Our information estimates accurately captured system
performance differences across four imaging domains (color photography, radio
astronomy, lensless imaging, and microscopy). Systems designed with IDEAL
matched the performance of those designed with end-to-end optimization, the
prevailing approach that jointly optimizes hardware and image processing
algorithms. These results establish mutual information as a universal
performance metric for imaging systems that enables both computationally
efficient design optimization and evaluation in real-world conditions.
A video summarizing this work can be found at:
https://waller-lab.github.io/EncodingInformationWebsite/
♻ ☆ SurgViVQA: Temporally-Grounded Video Question Answering for Surgical Scene Understanding
Mauro Orazio Drago, Luca Carlini, Pelinsu Celebi Balyemez, Dennis Pierantozzi, Chiara Lena, Cesare Hassan, Danail Stoyanov, Elena De Momi, Sophia Bano, Mobarak I. Hoque
Video Question Answering (VideoQA) in the surgical domain aims to enhance
intraoperative understanding by enabling AI models to reason over temporally
coherent events rather than isolated frames. Current approaches are limited to
static image features, and available datasets often lack temporal annotations,
ignoring the dynamics critical for accurate procedural interpretation. We
propose SurgViVQA, a surgical VideoQA model that extends visual reasoning from
static images to dynamic surgical scenes. It uses a Masked Video--Text Encoder
to fuse video and question features, capturing temporal cues such as motion and
tool--tissue interactions, which a fine-tuned large language model (LLM) then
decodes into coherent answers. To evaluate its performance, we curated
REAL-Colon-VQA, a colonoscopic video dataset that includes motion-related
questions and diagnostic attributes, as well as out-of-template questions with
rephrased or semantically altered formulations to assess model robustness.
Experimental validation on REAL-Colon-VQA and the public EndoVis18-VQA dataset
shows that SurgViVQA outperforms existing image-based VQA benchmark models,
particularly in keyword accuracy, improving over PitVQA by +11\% on
REAL-Colon-VQA and +9\% on EndoVis18-VQA. A perturbation study on the questions
further confirms improved generalizability and robustness to variations in
question phrasing. SurgViVQA and the REAL-Colon-VQA dataset provide a framework
for temporally-aware understanding in surgical VideoQA, enabling AI models to
interpret dynamic procedural contexts more effectively. Code and dataset
available at https://github.com/madratak/SurgViVQA.
♻ ☆ Are Minimal Radial Distortion Solvers Necessary for Relative Pose Estimation?
Estimating the relative pose between two cameras is a fundamental step in
many applications such as Structure-from-Motion. The common approach to
relative pose estimation is to apply a minimal solver inside a RANSAC loop.
Highly efficient solvers exist for pinhole cameras. Yet, (nearly) all cameras
exhibit radial distortion. Not modeling radial distortion leads to
(significantly) worse results. However, minimal radial distortion solvers are
significantly more complex than pinhole solvers, both in terms of run-time and
implementation efforts. This paper compares radial distortion solvers with a
simple-to-implement approach that combines an efficient pinhole solver with
sampled radial distortion parameters. Extensive experiments on multiple
datasets and RANSAC variants show that this simple approach performs similarly
or better than the most accurate minimal distortion solvers at faster run-times
while being significantly more accurate than faster non-minimal solvers. We
clearly show that complex radial distortion solvers are not necessary in
practice. Code and benchmark are available at https://github.com/kocurvik/rd.
comment: Code available at: https://github.com/kocurvik/rd or
https://doi.org/10.5281/zenodo.14672694
♻ ☆ Optimized Minimal 3D Gaussian Splatting
3D Gaussian Splatting (3DGS) has emerged as a powerful representation for
real-time, high-performance rendering, enabling a wide range of applications.
However, representing 3D scenes with numerous explicit Gaussian primitives
imposes significant storage and memory overhead. Recent studies have shown that
high-quality rendering can be achieved with a substantially reduced number of
Gaussians when represented with high-precision attributes. Nevertheless,
existing 3DGS compression methods still rely on a relatively large number of
Gaussians, focusing primarily on attribute compression. This is because a
smaller set of Gaussians becomes increasingly sensitive to lossy attribute
compression, leading to severe quality degradation. Since the number of
Gaussians is directly tied to computational costs, it is essential to reduce
the number of Gaussians effectively rather than only optimizing storage. In
this paper, we propose Optimized Minimal Gaussians representation (OMG), which
significantly reduces storage while using a minimal number of primitives.
First, we determine the distinct Gaussian from the near ones, minimizing
redundancy without sacrificing quality. Second, we propose a compact and
precise attribute representation that efficiently captures both continuity and
irregularity among primitives. Additionally, we propose a sub-vector
quantization technique for improved irregularity representation, maintaining
fast training with a negligible codebook size. Extensive experiments
demonstrate that OMG reduces storage requirements by nearly 50% compared to the
previous state-of-the-art and enables 600+ FPS rendering while maintaining high
rendering quality. Our source code is available at
https://maincold2.github.io/omg/.
comment: Project page: https://maincold2.github.io/omg/
♻ ☆ JaneEye: A 12-nm 2K-FPS 18.9-$μ$J/Frame Event-based Eye Tracking Accelerator IEEE 31
Eye tracking has become a key technology for gaze-based interactions in
Extended Reality (XR). However, conventional frame-based eye-tracking systems
often fall short of XR's stringent requirements for high accuracy, low latency,
and energy efficiency. Event cameras present a compelling alternative, offering
ultra-high temporal resolution and low power consumption. In this paper, we
present JaneEye, an energy-efficient event-based eye-tracking hardware
accelerator designed specifically for wearable devices, leveraging sparse,
high-temporal-resolution event data. We introduce an ultra-lightweight neural
network architecture featuring a novel ConvJANET layer, which simplifies the
traditional ConvLSTM by retaining only the forget gate, thereby halving
computational complexity without sacrificing temporal modeling capability. Our
proposed model achieves high accuracy with a pixel error of 2.45 on the 3ET+
dataset, using only 17.6K parameters, with up to 1250 Hz event frame rate. To
further enhance hardware efficiency, we employ custom linear approximations of
activation functions (hardsigmoid and hardtanh) and fixed-point quantization.
Through software-hardware co-design, our 12-nm ASIC implementation operates at
400 MHz, delivering an end-to-end latency of 0.5 ms (equivalent to 2000 Frames
Per Second (FPS)) at an energy efficiency of 18.9 $\mu$J/frame. JaneEye sets a
new benchmark in low-power, high-performance eye-tracking solutions suitable
for integration into next-generation XR wearables.
comment: Accepted to 2026 IEEE 31st Asia and South Pacific Design Automation
Conference (ASP-DAC)
♻ ☆ UMA: Ultra-detailed Human Avatars via Multi-level Surface Alignment
Learning an animatable and clothed human avatar model with vivid dynamics and
photorealistic appearance from multi-view videos is an important foundational
research problem in computer graphics and vision. Fueled by recent advances in
implicit representations, the quality of the animatable avatars has achieved an
unprecedented level by attaching the implicit representation to drivable human
template meshes. However, they usually fail to preserve the highest level of
detail, particularly apparent when the virtual camera is zoomed in and when
rendering at 4K resolution and higher. We argue that this limitation stems from
inaccurate surface tracking, specifically, depth misalignment and surface drift
between character geometry and the ground truth surface, which forces the
detailed appearance model to compensate for geometric errors. To address this,
we propose a latent deformation model and supervising the 3D deformation of the
animatable character using guidance from foundational 2D video point trackers,
which offer improved robustness to shading and surface variations, and are less
prone to local minima than differentiable rendering. To mitigate the drift over
time and lack of 3D awareness of 2D point trackers, we introduce a cascaded
training strategy that generates consistent 3D point tracks by anchoring point
tracks to the rendered avatar, which ultimately supervises our avatar at the
vertex and texel level. To validate the effectiveness of our approach, we
introduce a novel dataset comprising five multi-view video sequences, each over
10 minutes in duration, captured using 40 calibrated 6K-resolution cameras,
featuring subjects dressed in clothing with challenging texture patterns and
wrinkle deformations. Our approach demonstrates significantly improved
performance in rendering quality and geometric accuracy over the prior state of
the art.
comment: Project page: https://vcai.mpi-inf.mpg.de/projects/UMA/
♻ ☆ HoliSafe: Holistic Safety Benchmarking and Modeling for Vision-Language Model
Youngwan Lee, Kangsan Kim, Kwanyong Park, Ilcahe Jung, Soojin Jang, Seanie Lee, Yong-Ju Lee, Sung Ju Hwang
Despite emerging efforts to enhance the safety of Vision-Language Models
(VLMs), current approaches face two main shortcomings. 1) Existing
safety-tuning datasets and benchmarks only partially consider how image-text
interactions can yield harmful content, often overlooking contextually unsafe
outcomes from seemingly benign pairs. This narrow coverage leaves VLMs
vulnerable to jailbreak attacks in unseen configurations. 2) Prior methods rely
primarily on data-centric tuning, with limited architectural innovations to
intrinsically strengthen safety. We address these gaps by introducing a
holistic safety dataset and benchmark, \textbf{HoliSafe}, that spans all five
safe/unsafe image-text combinations, providing a more robust basis for both
training and evaluation (HoliSafe-Bench). We further propose a novel modular
framework for enhancing VLM safety with a visual guard module (VGM) designed to
assess the harmfulness of input images for VLMs. This module endows VLMs with a
dual functionality: they not only learn to generate safer responses but can
also provide an interpretable harmfulness classification to justify their
refusal decisions. A significant advantage of this approach is its modularity;
the VGM is designed as a plug-in component, allowing for seamless integration
with diverse pre-trained VLMs across various scales. Experiments show that
Safe-VLM with VGM, trained on our HoliSafe, achieves state-of-the-art safety
performance across multiple VLM benchmarks. Additionally, the HoliSafe-Bench
itself reveals critical vulnerabilities in existing VLM models. We hope that
HoliSafe and VGM will spur further research into robust and interpretable VLM
safety, expanding future avenues for multimodal alignment.
comment: Project page: https://youngwanlee.github.io/holisafe
♻ ☆ SLAM&Render: A Benchmark for the Intersection Between Neural Rendering, Gaussian Splatting and SLAM
Models and methods originally developed for Novel View Synthesis and Scene
Rendering, such as Neural Radiance Fields (NeRF) and Gaussian Splatting, are
increasingly being adopted as representations in Simultaneous Localization and
Mapping (SLAM). However, existing datasets fail to include the specific
challenges of both fields, such as sequential operations and, in many settings,
multi-modality in SLAM or generalization across viewpoints and illumination
conditions in neural rendering. Additionally, the data are often collected
using sensors which are handheld or mounted on drones or mobile robots, which
complicates the accurate reproduction of sensor motions. To bridge these gaps,
we introduce SLAM&Render, a novel dataset designed to benchmark methods in the
intersection between SLAM, Novel View Rendering and Gaussian Splatting.
Recorded with a robot manipulator, it uniquely includes 40 sequences with
time-synchronized RGB-D images, IMU readings, robot kinematic data, and
ground-truth pose streams. By releasing robot kinematic data, the dataset also
enables the assessment of recent integrations of SLAM paradigms within robotic
applications. The dataset features five setups with consumer and industrial
objects under four controlled lighting conditions, each with separate training
and test trajectories. All sequences are static with different levels of object
rearrangements and occlusions. Our experimental results, obtained with several
baselines from the literature, validate SLAM&Render as a relevant benchmark for
this emerging research area.
comment: 9 pages, 8 figures, submitted to The International Journal of
Robotics Research (IJRR)
♻ ☆ Hemorica: A Comprehensive CT Scan Dataset for Automated Brain Hemorrhage Classification, Segmentation, and Detection
Kasra Davoodi, Mohammad Hoseyni, Javad Khoramdel, Reza Barati, Reihaneh Mortazavi, Amirhossein Nikoofard, Mahdi Aliyari-Shoorehdeli, Jaber Hatam Parikhan
Timely diagnosis of Intracranial hemorrhage (ICH) on Computed Tomography (CT)
scans remains a clinical priority, yet the development of robust Artificial
Intelligence (AI) solutions is still hindered by fragmented public data. To
close this gap, we introduce Hemorica, a publicly available collection of 372
head CT examinations acquired between 2012 and 2024. Each scan has been
exhaustively annotated for five ICH subtypes-epidural (EPH), subdural (SDH),
subarachnoid (SAH), intraparenchymal (IPH), and intraventricular (IVH)-yielding
patient-wise and slice-wise classification labels, subtype-specific bounding
boxes, two-dimensional pixel masks and three-dimensional voxel masks. A
double-reading workflow, preceded by a pilot consensus phase and supported by
neurosurgeon adjudication, maintained low inter-rater variability.
Comprehensive statistical analysis confirms the clinical realism of the
dataset. To establish reference baselines, standard convolutional and
transformer architectures were fine-tuned for binary slice classification and
hemorrhage segmentation. With only minimal fine-tuning, lightweight models such
as MobileViT-XS achieved an F1 score of 87.8% in binary classification, whereas
a U-Net with a DenseNet161 encoder reached a Dice score of 85.5% for binary
lesion segmentation that validate both the quality of the annotations and the
sufficiency of the sample size. Hemorica therefore offers a unified,
fine-grained benchmark that supports multi-task and curriculum learning,
facilitates transfer to larger but weakly labelled cohorts, and facilitates the
process of designing an AI-based assistant for ICH detection and quantification
systems.
comment: We need to double check the data and statistics. We will publish the
complete version in coming months
♻ ☆ X-Diffusion: Generating Detailed 3D MRI Volumes From a Single Image Using Cross-Sectional Diffusion Models ICCV 2025
Magnetic Resonance Imaging (MRI) is a crucial diagnostic tool, but
high-resolution scans are often slow and expensive due to extensive data
acquisition requirements. Traditional MRI reconstruction methods aim to
expedite this process by filling in missing frequency components in the
K-space, performing 3D-to-3D reconstructions that demand full 3D scans. In
contrast, we introduce X-Diffusion, a novel cross-sectional diffusion model
that reconstructs detailed 3D MRI volumes from extremely sparse spatial-domain
inputs, achieving 2D-to-3D reconstruction from as little as a single 2D MRI
slice or few slices. A key aspect of X-Diffusion is that it models MRI data as
holistic 3D volumes during the cross-sectional training and inference, unlike
previous learning approaches that treat MRI scans as collections of 2D slices
in standard planes (coronal, axial, sagittal). We evaluated X-Diffusion on
brain tumor MRIs from the BRATS dataset and full-body MRIs from the UK Biobank
dataset. Our results demonstrate that X-Diffusion not only surpasses
state-of-the-art methods in quantitative accuracy (PSNR) on unseen data but
also preserves critical anatomical features such as tumor profiles, spine
curvature, and brain volume. Remarkably, the model generalizes beyond the
training domain, successfully reconstructing knee MRIs despite being trained
exclusively on brain data. Medical expert evaluations further confirm the
clinical relevance and fidelity of the generated images.To our knowledge,
X-Diffusion is the first method capable of producing detailed 3D MRIs from
highly limited 2D input data, potentially accelerating MRI acquisition and
reducing associated costs. The code is available on the project website
https://emmanuelleb985.github.io/XDiffusion/ .
comment: accepted at ICCV 2025 GAIA workshop
https://era-ai-biomed.github.io/GAIA/ , project website:
https://emmanuelleb985.github.io/XDiffusion/
♻ ☆ Robust Self-calibration of Focal Lengths from the Fundamental Matrix CVPR 2024
The problem of self-calibration of two cameras from a given fundamental
matrix is one of the basic problems in geometric computer vision. Under the
assumption of known principal points and square pixels, the well-known Bougnoux
formula offers a means to compute the two unknown focal lengths. However, in
many practical situations, the formula yields inaccurate results due to
commonly occurring singularities. Moreover, the estimates are sensitive to
noise in the computed fundamental matrix and to the assumed positions of the
principal points. In this paper, we therefore propose an efficient and robust
iterative method to estimate the focal lengths along with the principal points
of the cameras given a fundamental matrix and priors for the estimated camera
parameters. In addition, we study a computationally efficient check of models
generated within RANSAC that improves the accuracy of the estimated models
while reducing the total computational time. Extensive experiments on real and
synthetic data show that our iterative method brings significant improvements
in terms of the accuracy of the estimated focal lengths over the Bougnoux
formula and other state-of-the-art methods, even when relying on inaccurate
priors.
comment: Pubslished in CVPR 2024. Accepted: 26.2.2024. Published: 16.6.2024.
This work was funded by the Horizon-Widera-2021 European Twinning project
TERAIS G.A. n. 101079338. Code available:
https://github.com/kocurvik/robust_self_calibration and
https://doi.org/10.5281/zenodo.14584742
♻ ☆ Cross-modal Causal Intervention for Alzheimer's Disease Prediction
Mild Cognitive Impairment (MCI) serves as a prodromal stage of Alzheimer's
Disease (AD), where early identification and intervention can effectively slow
the progression to dementia. However, diagnosing AD remains a significant
challenge in neurology due to the confounders caused mainly by the selection
bias of multi-modal data and the complex relationships between variables. To
address these issues, we propose a novel visual-language causality-inspired
framework named Cross-modal Causal Intervention with Mediator for Alzheimer's
Disease Diagnosis (MediAD) for diagnostic assistance. Our MediAD employs Large
Language Models (LLMs) to summarize clinical data under strict templates,
therefore enriching textual inputs. The MediAD model utilizes Magnetic
Resonance Imaging (MRI), clinical data, and textual data enriched by LLMs to
classify participants into Cognitively Normal (CN), MCI, and AD categories.
Because of the presence of confounders, such as cerebral vascular lesions and
age-related biomarkers, non-causal models are likely to capture spurious
input-output correlations, generating less reliable results. Our framework
implicitly mitigates the effect of both observable and unobservable confounders
through a unified causal intervention method. Experimental results demonstrate
the outstanding performance of our method in distinguishing CN/MCI/AD cases,
outperforming other methods in most evaluation metrics. The study showcases the
potential of integrating causal reasoning with multi-modal learning for
neurological disease diagnosis.
♻ ☆ Three-view Focal Length Recovery From Homographies
Yaqing Ding, Viktor Kocur, Zuzana Berger Haladová, Qianliang Wu, Shen Cai, Jian Yang, Zuzana Kukelova
In this paper, we propose a novel approach for recovering focal lengths from
three-view homographies. By examining the consistency of normal vectors between
two homographies, we derive new explicit constraints between the focal lengths
and homographies using an elimination technique. We demonstrate that three-view
homographies provide two additional constraints, enabling the recovery of one
or two focal lengths. We discuss four possible cases, including three cameras
having an unknown equal focal length, three cameras having two different
unknown focal lengths, three cameras where one focal length is known, and the
other two cameras have equal or different unknown focal lengths. All the
problems can be converted into solving polynomials in one or two unknowns,
which can be efficiently solved using Sturm sequence or hidden variable
technique. Evaluation using both synthetic and real data shows that the
proposed solvers are both faster and more accurate than methods relying on
existing two-view solvers. The code and data are available on
https://github.com/kocurvik/hf
comment: Code available at https://github.com/kocurvik/hf or
https://doi.org/10.5281/zenodo.14672713 Data available at:
https://doi.org/10.5281/zenodo.14638904
♻ ☆ Source-Only Cross-Weather LiDAR via Geometry-Aware Point Drop
LiDAR semantic segmentation degrades in adverse weather because refraction,
scattering, and point dropouts corrupt geometry. Prior work in weather
simulation, mixing-based augmentation, domain randomization, and uncertainty or
boundary regularization improves robustness but still overlooks structural
vulnerabilities near boundaries, corners, and sparse regions. We present a
Light Geometry-aware adapter. The module aligns azimuth and applies horizontal
circular padding to preserve neighbor continuity across the 0~360 degree
wrap-around boundary. A local-window K-Nearest Neighbors gathers nearby points
and computes simple local statistics, which are compressed into compact
geometry-aware cues. During training, these cues drive region-aware
regularization that stabilizes predictions in structurally fragile areas. The
adapter is plug and play, complements augmentation, and can be enabled only
during training with negligible inference cost. We adopt a source-only
cross-weather setup where models train on SemanticKITTI and are evaluated on
SemanticSTF without target labels or fine-tuning. The adapter improves mIoU by
7.9 percentage points over the data-centric augmentation baseline and by 0.6
points over the class-centric regularization baseline. These results indicate
that geometry-driven regularization is a key direction for all-weather LiDAR
segmentation.
♻ ☆ GASP: Efficient Black-Box Generation of Adversarial Suffixes for Jailbreaking LLMs NeurIPS 2025
LLMs have shown impressive capabilities across various natural language
processing tasks, yet remain vulnerable to input prompts, known as jailbreak
attacks, carefully designed to bypass safety guardrails and elicit harmful
responses. Traditional methods rely on manual heuristics but suffer from
limited generalizability. Despite being automatic, optimization-based attacks
often produce unnatural prompts that can be easily detected by safety filters
or require high computational costs due to discrete token optimization. In this
paper, we introduce Generative Adversarial Suffix Prompter (GASP), a novel
automated framework that can efficiently generate human-readable jailbreak
prompts in a fully black-box setting. In particular, GASP leverages latent
Bayesian optimization to craft adversarial suffixes by efficiently exploring
continuous latent embedding spaces, gradually optimizing the suffix prompter to
improve attack efficacy while balancing prompt coherence via a targeted
iterative refinement procedure. Through comprehensive experiments, we show that
GASP can produce natural adversarial prompts, significantly improving jailbreak
success over baselines, reducing training times, and accelerating inference
speed, thus making it an efficient and scalable solution for red-teaming LLMs.
comment: Accepted to NeurIPS 2025. Project page and demos:
https://air-ml.org/project/gasp/
♻ ☆ MIND: Material Interface Generation from UDFs for Non-Manifold Surface Reconstruction NIPS 2025
Unsigned distance fields (UDFs) are widely used in 3D deep learning due to
their ability to represent shapes with arbitrary topology. While prior work has
largely focused on learning UDFs from point clouds or multi-view images,
extracting meshes from UDFs remains challenging, as the learned fields rarely
attain exact zero distances. A common workaround is to reconstruct signed
distance fields (SDFs) locally from UDFs to enable surface extraction via
Marching Cubes. However, this often introduces topological artifacts such as
holes or spurious components. Moreover, local SDFs are inherently incapable of
representing non-manifold geometry, leading to complete failure in such cases.
To address this gap, we propose MIND (Material Interface from Non-manifold
Distance fields), a novel algorithm for generating material interfaces directly
from UDFs, enabling non-manifold mesh extraction from a global perspective. The
core of our method lies in deriving a meaningful spatial partitioning from the
UDF, where the target surface emerges as the interface between distinct
regions. We begin by computing a two-signed local field to distinguish the two
sides of manifold patches, and then extend this to a multi-labeled global field
capable of separating all sides of a non-manifold structure. By combining this
multi-labeled field with the input UDF, we construct material interfaces that
support non-manifold mesh extraction via a multi-labeled Marching Cubes
algorithm. Extensive experiments on UDFs generated from diverse data sources,
including point cloud reconstruction, multi-view reconstruction, and medial
axis transforms, demonstrate that our approach robustly handles complex
non-manifold surfaces and significantly outperforms existing methods. The
source code is available at https://github.com/jjjkkyz/MIND.
comment: NIPS 2025
♻ ☆ WaveGuard: Robust Deepfake Detection and Source Tracing via Dual-Tree Complex Wavelet and Graph Neural Networks
Deepfake technology poses increasing risks such as privacy invasion and
identity theft. To address these threats, we propose WaveGuard, a proactive
watermarking framework that enhances robustness and imperceptibility via
frequency-domain embedding and graph-based structural consistency.
Specifically, we embed watermarks into high-frequency sub-bands using Dual-Tree
Complex Wavelet Transform (DT-CWT) and employ a Structural Consistency Graph
Neural Network (SC-GNN) to preserve visual quality. We also design an attention
module to refine embedding precision. Experimental results on face swap and
reenactment tasks demonstrate that WaveGuard outperforms state-of-the-art
methods in both robustness and visual quality. Code is available at
https://github.com/vpsg-research/WaveGuard.
comment: 14 pages, 6 figures, 7 tables
♻ ☆ Advanced Sign Language Video Generation with Compressed and Quantized Multi-Condition Tokenization
Cong Wang, Zexuan Deng, Zhiwei Jiang, Yafeng Yin, Fei Shen, Zifeng Cheng, Shiping Ge, Shiwei Gan, Qing Gu
Sign Language Video Generation (SLVG) seeks to generate identity-preserving
sign language videos from spoken language texts. Existing methods primarily
rely on the single coarse condition (\eg, skeleton sequences) as the
intermediary to bridge the translation model and the video generation model,
which limits both the naturalness and expressiveness of the generated videos.
To overcome these limitations, we propose SignViP, a novel SLVG framework that
incorporates multiple fine-grained conditions for improved generation fidelity.
Rather than directly translating error-prone high-dimensional conditions,
SignViP adopts a discrete tokenization paradigm to integrate and represent
fine-grained conditions (\ie, fine-grained poses and 3D hands). SignViP
contains three core components. (1) Sign Video Diffusion Model is jointly
trained with a multi-condition encoder to learn continuous embeddings that
encapsulate fine-grained motion and appearance. (2) Finite Scalar Quantization
(FSQ) Autoencoder is further trained to compress and quantize these embeddings
into discrete tokens for compact representation of the conditions. (3)
Multi-Condition Token Translator is trained to translate spoken language text
to discrete multi-condition tokens. During inference, Multi-Condition Token
Translator first translates the spoken language text into discrete
multi-condition tokens. These tokens are then decoded to continuous embeddings
by FSQ Autoencoder, which are subsequently injected into Sign Video Diffusion
Model to guide video generation. Experimental results show that SignViP
achieves state-of-the-art performance across metrics, including video quality,
temporal coherence, and semantic fidelity. The code is available at
https://github.com/umnooob/signvip/.
♻ ☆ Revisiting Residual Connections: Orthogonal Updates for Stable and Efficient Deep Networks
Residual connections are pivotal for deep neural networks, enabling greater
depth by mitigating vanishing gradients. However, in standard residual updates,
the module's output is directly added to the input stream. This can lead to
updates that predominantly reinforce or modulate the existing stream direction,
potentially underutilizing the module's capacity for learning entirely novel
features. In this work, we introduce Orthogonal Residual Update: we decompose
the module's output relative to the input stream and add only the component
orthogonal to this stream. This design aims to guide modules to contribute
primarily new representational directions, fostering richer feature learning
while promoting more efficient training. We demonstrate that our orthogonal
update strategy improves generalization accuracy and training stability across
diverse architectures (ResNetV2, Vision Transformers) and datasets (CIFARs,
TinyImageNet, ImageNet-1k), achieving, for instance, a +3.78 pp top-1 accuracy
gain for ViT-B on ImageNet-1k.
comment: 27 pages, maybe final version
♻ ☆ Scaling Laws for Task-Optimized Models of the Primate Visual Ventral Stream ICML25
When trained on large-scale object classification datasets, certain
artificial neural network models begin to approximate core object recognition
behaviors and neural response patterns in the primate brain. While recent
machine learning advances suggest that scaling compute, model size, and dataset
size improves task performance, the impact of scaling on brain alignment
remains unclear. In this study, we explore scaling laws for modeling the
primate visual ventral stream by systematically evaluating over 600 models
trained under controlled conditions on benchmarks spanning V1, V2, V4, IT and
behavior. We find that while behavioral alignment continues to scale with
larger models, neural alignment saturates. This observation remains true across
model architectures and training datasets, even though models with stronger
inductive biases and datasets with higher-quality images are more
compute-efficient. Increased scaling is especially beneficial for higher-level
visual areas, where small models trained on few samples exhibit only poor
alignment. Our results suggest that while scaling current architectures and
datasets might suffice for alignment with human core object recognition
behavior, it will not yield improved models of the brain's visual ventral
stream, highlighting the need for novel strategies in building brain models.
comment: Published at ICML25 as a spotlight paper - 9 pages for the main
paper, 22 pages in total. 7 main figures and 7 supplementary figures. Code,
model weights, and benchmark results can be accessed at
https://github.com/epflneuroailab/scaling-primate-vvs
♻ ☆ Toward Clinically Grounded Foundation Models in Pathology
In non-medical domains, foundation models (FMs) have revolutionized computer
vision and language processing through large-scale self-supervised and
multimodal learning. Consequently, their rapid adoption in computational
pathology was expected to deliver comparable breakthroughs in cancer diagnosis,
prognostication, and multimodal retrieval. However, recent systematic
evaluations reveal fundamental weaknesses: low diagnostic accuracy, poor
robustness, geometric instability, heavy computational demands, and concerning
safety vulnerabilities. This short paper examines these shortcomings and argues
that they stem from deeper conceptual mismatches between the assumptions
underlying generic foundation modeling in mainstream AI and the intrinsic
complexity of human tissue. Seven interrelated causes are identified:
biological complexity, ineffective self-supervision, overgeneralization,
excessive architectural complexity, lack of domain-specific innovation,
insufficient data, and a fundamental design flaw related to tissue patch size.
These findings suggest that current pathology foundation models remain
conceptually misaligned with the nature of tissue morphology and call for a
fundamental rethinking of the paradigm itself.
♻ ☆ Towards Efficient and Accurate Spiking Neural Networks via Adaptive Bit Allocation
Multi-bit spiking neural networks (SNNs) have recently become a heated
research spot, pursuing energy-efficient and high-accurate AI. However, with
more bits involved, the associated memory and computation demands escalate to
the point where the performance improvements become disproportionate. Based on
the insight that different layers demonstrate different importance and extra
bits could be wasted and interfering, this paper presents an adaptive bit
allocation strategy for direct-trained SNNs, achieving fine-grained layer-wise
allocation of memory and computation resources. Thus, SNN's efficiency and
accuracy can be improved. Specifically, we parametrize the temporal lengths and
the bit widths of weights and spikes, and make them learnable and controllable
through gradients. To address the challenges caused by changeable bit widths
and temporal lengths, we propose the refined spiking neuron, which can handle
different temporal lengths, enable the derivation of gradients for temporal
lengths, and suit spike quantization better. In addition, we theoretically
formulate the step-size mismatch problem of learnable bit widths, which may
incur severe quantization errors to SNN, and accordingly propose the step-size
renewal mechanism to alleviate this issue. Experiments on various datasets,
including the static CIFAR and ImageNet datasets and the dynamic CIFAR-DVS,
DVS-GESTURE, and SHD datasets, demonstrate that our methods can reduce the
overall memory and computation cost while achieving higher accuracy.
Particularly, our SEWResNet-34 can achieve a 2.69% accuracy gain and 4.16x
lower bit budgets over the advanced baseline work on ImageNet. This work will
be open-sourced.
♻ ☆ Seeing What Matters: Generalizable AI-generated Video Detection with Forensic-Oriented Augmentation
Synthetic video generation is progressing very rapidly. The latest models can
produce very realistic high-resolution videos that are virtually
indistinguishable from real ones. Although several video forensic detectors
have been recently proposed, they often exhibit poor generalization, which
limits their applicability in a real-world scenario. Our key insight to
overcome this issue is to guide the detector towards *seeing what really
matters*. In fact, a well-designed forensic classifier should focus on
identifying intrinsic low-level artifacts introduced by a generative
architecture rather than relying on high-level semantic flaws that characterize
a specific model. In this work, first, we study different generative
architectures, searching and identifying discriminative features that are
unbiased, robust to impairments, and shared across models. Then, we introduce a
novel forensic-oriented data augmentation strategy based on the wavelet
decomposition and replace specific frequency-related bands to drive the model
to exploit more relevant forensic cues. Our novel training paradigm improves
the generalizability of AI-generated video detectors, without the need for
complex algorithms and large datasets that include multiple synthetic
generators. To evaluate our approach, we train the detector using data from a
single generative model and test it against videos produced by a wide range of
other models. Despite its simplicity, our method achieves a significant
accuracy improvement over state-of-the-art detectors and obtains excellent
results even on very recent generative models, such as NOVA and FLUX.
♻ ☆ RadZero: Similarity-Based Cross-Attention for Explainable Vision-Language Alignment in Chest X-ray with Zero-Shot Multi-Task Capability NeurIPS 2025
Recent advancements in multimodal models have significantly improved
vision-language (VL) alignment in radiology. However, existing approaches
struggle to effectively utilize complex radiology reports for learning and
offer limited interpretability through attention probability visualizations. To
address these challenges, we introduce $\textbf{RadZero}$, a novel framework
for VL alignment in chest X-ray with zero-shot multi-task capability. A key
component of our approach is $\textbf{VL-CABS}$
($\textbf{V}$ision-$\textbf{L}$anguage $\textbf{C}$ross-$\textbf{A}$ttention
$\textbf{B}$ased on $\textbf{S}$imilarity), which aligns text embeddings with
local image features for interpretable, fine-grained VL reasoning. RadZero
leverages large language models to extract concise semantic sentences from
radiology reports and employs multi-positive contrastive training to
effectively capture relationships between images and multiple relevant textual
descriptions. It uses a pre-trained vision encoder with additional trainable
Transformer layers, allowing efficient high-resolution image processing. By
computing similarity between text embeddings and local image patch features,
VL-CABS enables zero-shot inference with similarity probability for
classification, and pixel-level VL similarity maps for grounding and
segmentation. Experimental results on public chest radiograph benchmarks show
that RadZero outperforms state-of-the-art methods in zero-shot classification,
grounding, and segmentation. Furthermore, VL similarity map analysis highlights
the potential of VL-CABS for improving explainability in VL alignment.
Additionally, qualitative evaluation demonstrates RadZero's capability for
open-vocabulary semantic segmentation, further validating its effectiveness in
medical imaging. Code is available at
$\href{https://github.com/deepnoid-ai/RadZero}{https://github.com/deepnoid-ai/RadZero}$.
comment: NeurIPS 2025
♻ ☆ Residual Diffusion Bridge Model for Image Restoration
Diffusion bridge models establish probabilistic paths between arbitrary
paired distributions and exhibit great potential for universal image
restoration. Most existing methods merely treat them as simple variants of
stochastic interpolants, lacking a unified analytical perspective. Besides,
they indiscriminately reconstruct images through global noise injection and
removal, inevitably distorting undegraded regions due to imperfect
reconstruction. To address these challenges, we propose the Residual Diffusion
Bridge Model (RDBM). Specifically, we theoretically reformulate the stochastic
differential equations of generalized diffusion bridge and derive the
analytical formulas of its forward and reverse processes. Crucially, we
leverage the residuals from given distributions to modulate the noise injection
and removal, enabling adaptive restoration of degraded regions while preserving
intact others. Moreover, we unravel the fundamental mathematical essence of
existing bridge models, all of which are special cases of RDBM and empirically
demonstrate the optimality of our proposed models. Extensive experiments are
conducted to demonstrate the state-of-the-art performance of our method both
qualitatively and quantitatively across diverse image restoration tasks. Code
is publicly available at https://github.com/MiliLab/RDBM.
♻ ☆ Pseudo-Stereo Inputs: A Solution to the Occlusion Challenge in Self-Supervised Stereo Matching
Self-supervised stereo matching holds great promise by eliminating the
reliance on expensive ground-truth data. Its dominant paradigm, based on
photometric consistency, is however fundamentally hindered by the occlusion
challenge -- an issue that persists regardless of network architecture. The
essential insight is that for any occluders, valid feedback signals can only be
derived from the unoccluded areas on one side of the occluder. Existing methods
attempt to address this by focusing on the erroneous feedback from the other
side, either by identifying and removing it, or by introducing additional
regularities for correction on that basis. Nevertheless, these approaches have
failed to provide a complete solution. This work proposes a more fundamental
solution. The core idea is to transform the fixed state of one-sided valid and
one-sided erroneous signals into a probabilistic acquisition of valid feedback
from both sides of an occluder. This is achieved through a complete framework,
centered on a pseudo-stereo inputs strategy that decouples the input and
feedback, without introducing any additional constraints. Qualitative results
visually demonstrate that the occlusion problem is resolved, manifested by
fully symmetrical and identical performance on both flanks of occluding
objects. Quantitative experiments thoroughly validate the significant
performance improvements resulting from solving the occlusion challenge.
♻ ☆ RealDPO: Real or Not Real, that is the Preference
Video generative models have recently achieved notable advancements in
synthesis quality. However, generating complex motions remains a critical
challenge, as existing models often struggle to produce natural, smooth, and
contextually consistent movements. This gap between generated and real-world
motions limits their practical applicability. To address this issue, we
introduce RealDPO, a novel alignment paradigm that leverages real-world data as
positive samples for preference learning, enabling more accurate motion
synthesis. Unlike traditional supervised fine-tuning (SFT), which offers
limited corrective feedback, RealDPO employs Direct Preference Optimization
(DPO) with a tailored loss function to enhance motion realism. By contrasting
real-world videos with erroneous model outputs, RealDPO enables iterative
self-correction, progressively refining motion quality. To support
post-training in complex motion synthesis, we propose RealAction-5K, a curated
dataset of high-quality videos capturing human daily activities with rich and
precise motion details. Extensive experiments demonstrate that RealDPO
significantly improves video quality, text alignment, and motion realism
compared to state-of-the-art models and existing preference optimization
techniques.
comment: Code:https://github.com/Vchitect/RealDPO Project
Page:https://vchitect.github.io/RealDPO-Project/
♻ ☆ CFReID: Continual Few-shot Person Re-Identification
Real-world surveillance systems are dynamically evolving, requiring a person
Re-identification model to continuously handle newly incoming data from various
domains. To cope with these dynamics, Lifelong ReID (LReID) has been proposed
to learn and accumulate knowledge across multiple domains incrementally.
However, LReID models need to be trained on large-scale labeled data for each
unseen domain, which are typically inaccessible due to privacy and cost
concerns. In this paper, we propose a new paradigm called Continual Few-shot
ReID (CFReID), which requires models to be incrementally trained using few-shot
data and tested on all seen domains. Under few-shot conditions, CFREID faces
two core challenges: 1) learning knowledge from few-shot data of unseen domain,
and 2) avoiding catastrophic forgetting of seen domains. To tackle these two
challenges, we propose a Stable Distribution Alignment (SDA) framework from
feature distribution perspective. Specifically, our SDA is composed of two
modules, i.e., Meta Distribution Alignment (MDA) and Prototype-based Few-shot
Adaptation (PFA). To support the study of CFReID, we establish an evaluation
benchmark for CFReID on five publicly available ReID datasets. Extensive
experiments demonstrate that our SDA can enhance the few-shot learning and
anti-forgetting capabilities under few-shot conditions. Notably, our approach,
using only 5\% of the data, i.e., 32 IDs, significantly outperforms LReID's
state-of-the-art performance, which requires 700 to 1,000 IDs.
comment: This manuscript has been withdrawn due to significant restructuring
of its contents. The extended sections are being developed into a standalone
paper
♻ ☆ MCTED: A Machine-Learning-Ready Dataset for Digital Elevation Model Generation From Mars Imagery
This work presents a new dataset for the Martian digital elevation model
prediction task, ready for machine learning applications called MCTED. The
dataset has been generated using a comprehensive pipeline designed to process
high-resolution Mars orthoimage and DEM pairs from Day et al., yielding a
dataset consisting of 80,898 data samples. The source images are data gathered
by the Mars Reconnaissance Orbiter using the CTX instrument, providing a very
diverse and comprehensive coverage of the Martian surface. Given the complexity
of the processing pipelines used in large-scale DEMs, there are often artefacts
and missing data points in the original data, for which we developed tools to
solve or mitigate their impact. We divide the processed samples into training
and validation splits, ensuring samples in both splits cover no mutual areas to
avoid data leakage. Every sample in the dataset is represented by the optical
image patch, DEM patch, and two mask patches, indicating values that were
originally missing or were altered by us. This allows future users of the
dataset to handle altered elevation regions as they please. We provide
statistical insights of the generated dataset, including the spatial
distribution of samples, the distributions of elevation values, slopes and
more. Finally, we train a small U-Net architecture on the MCTED dataset and
compare its performance to a monocular depth estimation foundation model,
DepthAnythingV2, on the task of elevation prediction. We find that even a very
small architecture trained on this dataset specifically, beats a zero-shot
performance of a depth estimation foundation model like DepthAnythingV2. We
make the dataset and code used for its generation completely open source in
public repositories.
comment: 22 pages, 21 figures
♻ ☆ BasicAVSR: Arbitrary-Scale Video Super-Resolution via Image Priors and Enhanced Motion Compensation
Arbitrary-scale video super-resolution (AVSR) aims to enhance the resolution
of video frames, potentially at various scaling factors, which presents several
challenges regarding spatial detail reproduction, temporal consistency, and
computational complexity. In this paper, we propose a strong baseline BasicAVSR
for AVSR by integrating four key components: 1) adaptive multi-scale frequency
priors generated from image Laplacian pyramids, 2) a flow-guided propagation
unit to aggregate spatiotemporal information from adjacent frames, 3) a
second-order motion compensation unit for more accurate spatial alignment of
adjacent frames, and 4) a hyper-upsampling unit to generate scale-aware and
content-independent upsampling kernels. To meet diverse application demands, we
instantiate three propagation variants: (i) a unidirectional RNN unit for
strictly online inference, (ii) a unidirectional RNN unit empowered with a
limited lookahead that tolerates a small output delay, and (iii) a
bidirectional RNN unit designed for offline tasks where computational resources
are less constrained. Experimental results demonstrate the effectiveness and
adaptability of our model across these different scenarios. Through extensive
experiments, we show that BasicAVSR significantly outperforms existing methods
in terms of super-resolution quality, generalization ability, and inference
speed. Our work not only advances the state-of-the-art in AVSR but also extends
its core components to multiple frameworks for diverse scenarios. The code is
available at https://github.com/shangwei5/BasicAVSR.
comment: 13 pages, 10 figures, 5 tables
♻ ☆ DOVE: Efficient One-Step Diffusion Model for Real-World Video Super-Resolution NeurIPS 2025
Diffusion models have demonstrated promising performance in real-world video
super-resolution (VSR). However, the dozens of sampling steps they require,
make inference extremely slow. Sampling acceleration techniques, particularly
single-step, provide a potential solution. Nonetheless, achieving one step in
VSR remains challenging, due to the high training overhead on video data and
stringent fidelity demands. To tackle the above issues, we propose DOVE, an
efficient one-step diffusion model for real-world VSR. DOVE is obtained by
fine-tuning a pretrained video diffusion model (i.e., CogVideoX). To
effectively train DOVE, we introduce the latent-pixel training strategy. The
strategy employs a two-stage scheme to gradually adapt the model to the video
super-resolution task. Meanwhile, we design a video processing pipeline to
construct a high-quality dataset tailored for VSR, termed HQ-VSR. Fine-tuning
on this dataset further enhances the restoration capability of DOVE. Extensive
experiments show that DOVE exhibits comparable or superior performance to
multi-step diffusion-based VSR methods. It also offers outstanding inference
efficiency, achieving up to a 28$\times$ speed-up over existing methods such as
MGLD-VSR. Code is available at: https://github.com/zhengchen1999/DOVE.
comment: Accepted to NeurIPS 2025. Code is available at:
https://github.com/zhengchen1999/DOVE
♻ ☆ Learning to Navigate Socially Through Proactive Risk Perception
Erjia Xiao, Lingfeng Zhang, Yingbo Tang, Hao Cheng, Renjing Xu, Wenbo Ding, Lei Zhou, Long Chen, Hangjun Ye, Xiaoshuai Hao
In this report, we describe the technical details of our submission to the
IROS 2025 RoboSense Challenge Social Navigation Track. This track focuses on
developing RGBD-based perception and navigation systems that enable autonomous
agents to navigate safely, efficiently, and socially compliantly in dynamic
human-populated indoor environments. The challenge requires agents to operate
from an egocentric perspective using only onboard sensors including RGB-D
observations and odometry, without access to global maps or privileged
information, while maintaining social norm compliance such as safe distances
and collision avoidance. Building upon the Falcon model, we introduce a
Proactive Risk Perception Module to enhance social navigation performance. Our
approach augments Falcon with collision risk understanding that learns to
predict distance-based collision risk scores for surrounding humans, which
enables the agent to develop more robust spatial awareness and proactive
collision avoidance behaviors. The evaluation on the Social-HM3D benchmark
demonstrates that our method improves the agent's ability to maintain personal
space compliance while navigating toward goals in crowded indoor scenes with
dynamic human agents, achieving 2nd place among 16 participating teams in the
challenge.
♻ ☆ MMPerspective: Do MLLMs Understand Perspective? A Comprehensive Benchmark for Perspective Perception, Reasoning, and Robustness NeurIPS 2025
Yolo Yunlong Tang, Pinxin Liu, Zhangyun Tan, Mingqian Feng, Rui Mao, Chao Huang, Jing Bi, Yunzhong Xiao, Susan Liang, Hang Hua, Ali Vosoughi, Luchuan Song, Zeliang Zhang, Chenliang Xu
Understanding perspective is fundamental to human visual perception, yet the
extent to which multimodal large language models (MLLMs) internalize
perspective geometry remains unclear. We introduce MMPerspective, the first
benchmark specifically designed to systematically evaluate MLLMs' understanding
of perspective through 10 carefully crafted tasks across three complementary
dimensions: Perspective Perception, Reasoning, and Robustness. Our benchmark
comprises 2,711 real-world and synthetic image instances with 5,083
question-answer pairs that probe key capabilities, such as vanishing point
perception and counting, perspective type reasoning, line relationship
understanding in 3D space, invariance to perspective-preserving
transformations, etc. Through a comprehensive evaluation of 43 state-of-the-art
MLLMs, we uncover significant limitations: while models demonstrate competence
on surface-level perceptual tasks, they struggle with compositional reasoning
and maintaining spatial consistency under perturbations. Our analysis further
reveals intriguing patterns between model architecture, scale, and perspective
capabilities, highlighting both robustness bottlenecks and the benefits of
chain-of-thought prompting. MMPerspective establishes a valuable testbed for
diagnosing and advancing spatial understanding in vision-language systems.
Resources available at: https://yunlong10.github.io/MMPerspective/
comment: Accepted to NeurIPS 2025 DB Track. Rating: 5,5,5,5
♻ ☆ FlexAC: Towards Flexible Control of Associative Reasoning in Multimodal Large Language Models NeurIPS 2025
Multimodal large language models (MLLMs) face an inherent trade-off between
faithfulness and creativity, as different tasks require varying degrees of
associative reasoning. However, existing methods lack the flexibility to
modulate this reasoning strength, limiting MLLMs' adaptability across factual
and creative scenarios. To bridge this gap, we propose equipping MLLMs with
mechanisms that enable flexible control over associative reasoning. We begin by
investigating the internal mechanisms underlying associative behavior in MLLMs
and find that: (1) middle layers play a pivotal role in shaping model's
associative tendencies, (2) modifying representations in these layers
effectively regulates associative reasoning strength, and (3) hallucinations
can be exploited to derive steering vectors that guide this modulation.
Building on these findings, we introduce Flexible Association Control (FlexAC),
a lightweight and training-free framework for modulating associative behavior
in MLLMs. FlexAC first induces hallucination-guided intermediate
representations to encode associative directions. Then, it selects
high-association instances to construct effective associative steering vectors,
whose strengths are adaptively calibrated to balance creative guidance with
output stability. Finally, recognizing the multi-dimensional nature of
associative reasoning, FlexAC incorporates task-specific associative vectors
derived from a forward pass on a few target-domain samples, enabling models to
follow diverse associative directions and better adapt to creative tasks.
Notably, our method achieves up to a 5.8x improvement in creativity on
Creation-MMBench and a 29% reduction in hallucination rate on CHAIR, surpassing
existing baselines and demonstrating its effectiveness in enabling flexible
control over associative reasoning in MLLMs. Our code is available at
https://github.com/ylhz/FlexAC.
comment: 19 pages, 11 figures. Accepted by the 39th Conference on Neural
Information Processing Systems (NeurIPS 2025)
♻ ☆ Poutine: Vision-Language-Trajectory Pre-Training and Reinforcement Learning Post-Training Enable Robust End-to-End Autonomous Driving
Maintaining good driving behavior in out-of-distribution scenarios remains a
critical challenge in autonomous driving. A promising direction is to leverage
the generalist knowledge and reasoning capabilities of large-language models by
treating unusual driving scenarios as a logical reasoning task. In this work,
we present Poutine, a method that uses an off-the-shelf 3B-parameter
vision-language model (VLM) - without any additional components - to achieve
robust end-to-end autonomous driving via a simple and scalable training recipe.
To learn strong base driving capabilities, we first train Poutine-Base using
self-supervised next-token prediction over vision, language, and trajectory
(VLT) tokens, leveraging both nominal and long-tail driving data. In the second
stage, we fine-tune Poutine-Base using Group Relative Policy Optimization
(GRPO) with a small set of human preference-labeled examples. We evaluated our
approach on the Waymo end-to-end driving benchmark curated for long-tail
scenarios. The final Poutine model achieves an RFS of 7.99 on the test set,
placing 1st in the 2025 Waymo Vision-Based End-to-End Driving Challenge by a
significant margin. Our results suggest that handcrafted tokenizers or custom
architectural components added to base VLMs in prior work are not necessary to
achieve strong driving performance. Instead, this work highlights the potential
of scalable VLT pretraining combined with lightweight RL fine-tuning to enable
robust and generalizable autonomous driving.
♻ ☆ Two Causally Related Needles in a Video Haystack NeurIPS 2025
Properly evaluating the ability of Video-Language Models (VLMs) to understand
long videos remains a challenge. We propose a long-context video understanding
benchmark, Causal2Needles, that assesses two crucial abilities insufficiently
addressed by existing benchmarks: (1) extracting information from two separate
locations (two needles) in a long video and understanding them jointly, and (2)
modeling the world in terms of cause and effect in human behaviors.
Causal2Needles evaluates these abilities using noncausal one-needle, causal
one-needle, and causal two-needle questions. The most complex question type,
causal two-needle questions, require extracting information from both the cause
and effect events from a long video and the associated narration text. To
prevent textual bias, we introduce two complementary question formats: locating
the video clip containing the answer, and verbal description of a visual detail
from that video clip. Our experiments reveal that models excelling on existing
benchmarks struggle with causal 2-needle questions, and the model performance
is negatively correlated with the distance between the two needles. These
findings highlight critical limitations in current VLMs. The dataset is
available at: https://huggingface.co/datasets/causal2needles/Causal2Needles
comment: Accepted to NeurIPS 2025 D&B Track
♻ ☆ Shallow Diffuse: Robust and Invisible Watermarking through Low-Dimensional Subspaces in Diffusion Models NeurIPS 2025
The widespread use of AI-generated content from diffusion models has raised
significant concerns regarding misinformation and copyright infringement.
Watermarking is a crucial technique for identifying these AI-generated images
and preventing their misuse. In this paper, we introduce Shallow Diffuse, a new
watermarking technique that embeds robust and invisible watermarks into
diffusion model outputs. Unlike existing approaches that integrate watermarking
throughout the entire diffusion sampling process, Shallow Diffuse decouples
these steps by leveraging the presence of a low-dimensional subspace in the
image generation process. This method ensures that a substantial portion of the
watermark lies in the null space of this subspace, effectively separating it
from the image generation process. Our theoretical and empirical analyses show
that this decoupling strategy greatly enhances the consistency of data
generation and the detectability of the watermark. Extensive experiments
further validate that our Shallow Diffuse outperforms existing watermarking
methods in terms of robustness and consistency. The codes are released at
https://github.com/liwd190019/Shallow-Diffuse.
comment: NeurIPS 2025 Spotlight
♻ ☆ EMHI: A Multimodal Egocentric Human Motion Dataset with HMD and Body-Worn IMUs
Egocentric human pose estimation (HPE) using wearable sensors is essential
for VR/AR applications. Most methods rely solely on either egocentric-view
images or sparse Inertial Measurement Unit (IMU) signals, leading to
inaccuracies due to self-occlusion in images or the sparseness and drift of
inertial sensors. Most importantly, the lack of real-world datasets containing
both modalities is a major obstacle to progress in this field. To overcome the
barrier, we propose EMHI, a multimodal \textbf{E}gocentric human
\textbf{M}otion dataset with \textbf{H}ead-Mounted Display (HMD) and body-worn
\textbf{I}MUs, with all data collected under the real VR product suite.
Specifically, EMHI provides synchronized stereo images from downward-sloping
cameras on the headset and IMU data from body-worn sensors, along with pose
annotations in SMPL format. This dataset consists of 885 sequences captured by
58 subjects performing 39 actions, totaling about 28.5 hours of recording. We
evaluate the annotations by comparing them with optical marker-based SMPL
fitting results. To substantiate the reliability of our dataset, we introduce
MEPoser, a new baseline method for multimodal egocentric HPE, which employs a
multimodal fusion encoder, temporal feature encoder, and MLP-based regression
heads. The experiments on EMHI show that MEPoser outperforms existing
single-modal methods and demonstrates the value of our dataset in solving the
problem of egocentric HPE. We believe the release of EMHI and the method could
advance the research of egocentric HPE and expedite the practical
implementation of this technology in VR/AR products.
♻ ☆ EarthGPT-X: A Spatial MLLM for Multi-level Multi-Source Remote Sensing Imagery Understanding with Visual Prompting
Wei Zhang, Miaoxin Cai, Yaqian Ning, Tong Zhang, Yin Zhuang, Shijian Lu, He Chen, Jun Li, Xuerui Mao
Recent advances in natural-domain multi-modal large language models (MLLMs)
have demonstrated effective spatial reasoning through visual and textual
prompting. However, their direct transfer to remote sensing (RS) is hindered by
heterogeneous sensing physics, diverse modalities, and unique spatial scales.
Existing RS MLLMs are mainly limited to optical imagery and plain language
interaction, preventing flexible and scalable real-world applications. In this
article, EarthGPT-X is proposed, the first flexible spatial MLLM that unifies
multi-source RS imagery comprehension and accomplishes both coarse-grained and
fine-grained visual tasks under diverse visual prompts in a single framework.
Distinct from prior models, EarthGPT-X introduces: 1) a dual-prompt mechanism
combining text instructions with various visual prompts (i.e., point, box, and
free-form) to mimic the versatility of referring in human life; 2) a
comprehensive multi-source multi-level prompting dataset, the model advances
beyond holistic image understanding to support hierarchical spatial reasoning,
including scene-level understanding and fine-grained object attributes and
relational analysis; 3) a cross-domain one-stage fusion training strategy,
enabling efficient and consistent alignment across modalities and tasks.
Extensive experiments demonstrate that EarthGPT-X substantially outperforms
prior nature and RS MLLMs, establishing the first framework capable of
multi-source, multi-task, and multi-level interpretation using visual prompting
in RS scenarios.
♻ ☆ Caption-Driven Explainability: Probing CNNs for Bias via CLIP IEEE
Robustness has become one of the most critical problems in machine learning
(ML). The science of interpreting ML models to understand their behavior and
improve their robustness is referred to as explainable artificial intelligence
(XAI). One of the state-of-the-art XAI methods for computer vision problems is
to generate saliency maps. A saliency map highlights the pixel space of an
image that excites the ML model the most. However, this property could be
misleading if spurious and salient features are present in overlapping pixel
spaces. In this paper, we propose a caption-based XAI method, which integrates
a standalone model to be explained into the contrastive language-image
pre-training (CLIP) model using a novel network surgery approach. The resulting
caption-based XAI model identifies the dominant concept that contributes the
most to the models prediction. This explanation minimizes the risk of the
standalone model falling for a covariate shift and contributes significantly
towards developing robust ML models. Our code is available at
https://github.com/patch0816/caption-driven-xai
comment: Accepted and presented at the IEEE ICIP 2025 Satellite Workshop
"Generative AI for World Simulations and Communications & Celebrating 40
Years of Excellence in Education: Honoring Prof. Aggelos Katsaggelos",
Anchorage, USA, Sept 14, 2025. Camera-ready preprint; IEEE Xplore version to
follow. Author variant: Amil Dravid. Code:
https://github.com/patch0816/caption-driven-xai
♻ ☆ Revealing the structure-property relationships of copper alloys with FAGC
Cu-Cr-Zr alloys play a crucial role in electronic devices and the electric
power industry, where their electrical conductivity and hardness are of great
importance. However, due to the scarcity of available samples, there has been a
lack of effective studies exploring the relationship between the
microstructural images of Cu-Cr-Zr alloys and their key properties. In this
paper, the FAGC feature augmentation method is employed to enhance the
microstructural images of Cu-Cr-Zr alloys within a feature space known as the
pre-shape space. Pseudo-labels are then constructed to expand the number of
training samples. These features are then input into various machine learning
models to construct performance prediction models for the alloy. Finally, we
validate the impact of different machine learning methods and the number of
augmented features on prediction accuracy through experiments. Experimental
results demonstrate that our method achieves superior performance in predicting
electrical conductivity (\(R^2=0.978\)) and hardness (\(R^2=0.998\)) when using
the decision tree classifier with 100 augmented samples. Further analysis
reveals that regions with reduced image noise, such as fewer grain or phase
boundaries, exhibit higher contributions to electrical conductivity. These
findings highlight the potential of the FAGC method in overcoming the
challenges of limited image data in materials science, offering a powerful tool
for establishing detailed and quantitative relationships between complex
microstructures and material properties.
♻ ☆ Zero-Shot Referring Expression Comprehension via Vison-Language True/False Verification
Referring Expression Comprehension (REC) is usually addressed with
task-trained grounding models. We show that a zero-shot workflow, without any
REC-specific training, can achieve competitive or superior performance. Our
approach reformulates REC as box-wise visual-language verification: given
proposals from a COCO-clean generic detector (YOLO-World), a general-purpose
VLM independently answers True/False queries for each region. This simple
procedure reduces cross-box interference, supports abstention and multiple
matches, and requires no fine-tuning. On RefCOCO, RefCOCO+, and RefCOCOg, our
method not only surpasses a zero-shot GroundingDINO baseline but also exceeds
reported results for GroundingDINO trained on REC and GroundingDINO+CRG.
Controlled studies with identical proposals confirm that verification
significantly outperforms selection-based prompting, and results hold with open
VLMs. Overall, we show that workflow design, rather than task-specific
pretraining, drives strong zero-shot REC performance.
♻ ☆ Assessing the value of Geo-Foundational Models for Flood Inundation Mapping: Benchmarking models for Sentinel-1, Sentinel-2, and Planetscope for end-users
Geo-Foundational Models (GFMs) enable fast and reliable extraction of
spatiotemporal information from satellite imagery, improving flood inundation
mapping by leveraging location and time embeddings. Despite their potential, it
remains unclear whether GFMs outperform traditional models like U-Net. A
systematic comparison across sensors and data availability scenarios is still
lacking, which is an essential step to guide end-users in model selection. To
address this, we evaluate three GFMs, Prithvi 2.0, Clay V1.5, DOFA, and UViT (a
Prithvi variant), against TransNorm, U-Net, and Attention U-Net using
PlanetScope, Sentinel-1, and Sentinel-2. We observe competitive performance
among all GFMs, with only 2-5% variation between the best and worst models
across sensors. Clay outperforms others on PlanetScope (0.79 mIoU) and
Sentinel-2 (0.70), while Prithvi leads on Sentinel-1 (0.57). In
leave-one-region-out cross-validation across five regions, Clay shows slightly
better performance across all sensors (mIoU: 0.72(0.04), 0.66(0.07),
0.51(0.08)) compared to Prithvi (0.70(0.05), 0.64(0.09), 0.49(0.13)) and DOFA
(0.67(0.07), 0.64(0.04), 0.49(0.09)) for PlanetScope, Sentinel-2, and
Sentinel-1, respectively. Across all 19 sites, leave-one-region-out
cross-validation reveals a 4% improvement by Clay compared to U-Net. Visual
inspection highlights Clay's superior ability to retain fine details. Few-shot
experiments show Clay achieves 0.64 mIoU on PlanetScope with just five training
images, outperforming Prithvi (0.24) and DOFA (0.35). In terms of computational
time, Clay is a better choice due to its smaller model size (26M parameters),
making it ~3x faster than Prithvi (650M) and 2x faster than DOFA (410M).
Contrary to previous findings, our results suggest GFMs offer small to moderate
improvements in flood mapping accuracy at lower computational cost and labeling
effort compared to traditional U-Net.
♻ ☆ Evaluating and Improving the Effectiveness of Synthetic Chest X-Rays for Medical Image Analysis
Eva Prakash, Jeya Maria Jose Valanarasu, Zhihong Chen, Eduardo Pontes Reis, Andrew Johnston, Anuj Pareek, Christian Bluethgen, Sergios Gatidis, Cameron Olsen, Akshay Chaudhari, Andrew Ng, Curtis Langlotz
Purpose: To explore best-practice approaches for generating synthetic chest
X-ray images and augmenting medical imaging datasets to optimize the
performance of deep learning models in downstream tasks like classification and
segmentation. Materials and Methods: We utilized a latent diffusion model to
condition the generation of synthetic chest X-rays on text prompts and/or
segmentation masks. We explored methods like using a proxy model and using
radiologist feedback to improve the quality of synthetic data. These synthetic
images were then generated from relevant disease information or geometrically
transformed segmentation masks and added to ground truth training set images
from the CheXpert, CANDID-PTX, SIIM, and RSNA Pneumonia datasets to measure
improvements in classification and segmentation model performance on the test
sets. F1 and Dice scores were used to evaluate classification and segmentation
respectively. One-tailed t-tests with Bonferroni correction assessed the
statistical significance of performance improvements with synthetic data.
Results: Across all experiments, the synthetic data we generated resulted in a
maximum mean classification F1 score improvement of 0.150453 (CI:
0.099108-0.201798; P=0.0031) compared to using only real data. For
segmentation, the maximum Dice score improvement was 0.14575 (CI:
0.108267-0.183233; P=0.0064). Conclusion: Best practices for generating
synthetic chest X-ray images for downstream tasks include conditioning on
single-disease labels or geometrically transformed segmentation masks, as well
as potentially using proxy modeling for fine-tuning such generations.
♻ ☆ Gestura: A LVLM-Powered System Bridging Motion and Semantics for Real-Time Free-Form Gesture Understanding
Free-form gesture understanding is highly appealing for human-computer
interaction, as it liberates users from the constraints of predefined gesture
categories. However, the sole existing solution GestureGPT suffers from limited
recognition accuracy and slow response times. In this paper, we propose
Gestura, an end-to-end system for free-form gesture understanding. Gestura
harnesses a pre-trained Large Vision-Language Model (LVLM) to align the highly
dynamic and diverse patterns of free-form gestures with high-level semantic
concepts. To better capture subtle hand movements across different styles, we
introduce a Landmark Processing Module that compensate for LVLMs' lack of
fine-grained domain knowledge by embedding anatomical hand priors. Further, a
Chain-of-Thought (CoT) reasoning strategy enables step-by-step semantic
inference, transforming shallow knowledge into deep semantic understanding and
significantly enhancing the model's ability to interpret ambiguous or
unconventional gestures. Together, these components allow Gestura to achieve
robust and adaptable free-form gesture comprehension. Additionally, we have
developed the first open-source dataset for free-form gesture intention
reasoning and understanding with over 300,000 annotated QA pairs.
comment: IMWUT2025
♻ ☆ OmniVLA: Physically-Grounded Multimodal VLA with Unified Multi-Sensor Perception for Robotic Manipulation
Heyu Guo, Shanmu Wang, Ruichun Ma, Shiqi Jiang, Yasaman Ghasempour, Omid Abari, Baining Guo, Lili Qiu
Vision-language-action (VLA) models have shown strong generalization for
robotic action prediction through large-scale vision-language pretraining.
However, most existing models rely solely on RGB cameras, limiting their
perception and, consequently, manipulation capabilities. We present OmniVLA, an
omni-modality VLA model that integrates novel sensing modalities for
physically-grounded spatial intelligence beyond RGB perception. The core of our
approach is the sensor-masked image, a unified representation that overlays
spatially grounded and physically meaningful masks onto the RGB images, derived
from sensors including an infrared camera, a mmWave radar, and a microphone
array. This image-native unification keeps sensor input close to RGB statistics
to facilitate training, provides a uniform interface across sensor hardware,
and enables data-efficient learning with lightweight per-sensor projectors.
Built on this, we present a multisensory vision-language-action model
architecture and train the model based on an RGB-pretrained VLA backbone. We
evaluate OmniVLA on challenging real-world tasks where sensor-modality
perception guides the robotic manipulation. OmniVLA achieves an average task
success rate of 84%, significantly outperforms both RGB-only and
raw-sensor-input baseline models by 59% and 28% respectively, meanwhile showing
higher learning efficiency and stronger generalization capability.
♻ ☆ What Time Tells Us? An Explorative Study of Time Awareness Learned from Static Images
Time becomes visible through illumination changes in what we see. Inspired by
this, in this paper we explore the potential to learn time awareness from
static images, trying to answer: *what time tells us?* To this end, we first
introduce a Time-Oriented Collection (TOC) dataset, which contains 130,906
images with reliable timestamps. Leveraging this dataset, we propose a
Time-Image Contrastive Learning (TICL) approach to jointly model timestamps and
related visual representations through cross-modal contrastive learning. We
found that the proposed TICL, 1) not only achieves state-of-the-art performance
on the timestamp estimation task, over various benchmark metrics, 2) but also,
interestingly, though only seeing static images, the time-aware embeddings
learned from TICL show strong capability in several time-aware downstream tasks
such as time-based image retrieval, video scene classification, and time-aware
image editing. Our findings suggest that time-related visual cues can be
learned from static images and are beneficial for various vision tasks, laying
a foundation for future research on understanding time-related visual context.
Project page: https://rathgrith.github.io/timetells_release/
comment: Accepted by TMLR 2025
♻ ☆ Practical solutions to the relative pose of three calibrated cameras CVPR 2025
Charalambos Tzamos, Viktor Kocur, Yaqing Ding, Daniel Barath, Zuzana Berger Haladova, Torsten Sattler, Zuzana Kukelova
We study the challenging problem of estimating the relative pose of three
calibrated cameras from four point correspondences. We propose novel efficient
solutions to this problem that are based on the simple idea of using four
correspondences to estimate an approximate geometry of the first two views. We
model this geometry either as an affine or a fully perspective geometry
estimated using one additional approximate correspondence. We generate such an
approximate correspondence using a very simple and efficient strategy, where
the new point is the mean point of three corresponding input points. The new
solvers are efficient and easy to implement, since they are based on existing
efficient minimal solvers, i.e., the 4-point affine fundamental matrix, the
well-known 5-point relative pose solver, and the P3P solver. Extensive
experiments on real data show that the proposed solvers, when properly coupled
with local optimization, achieve state-of-the-art results, with the novel
solver based on approximate mean-point correspondences being more robust and
accurate than the affine-based solver.
comment: Paper presented at CVPR 2025 (DOI: 10.1109/CVPR52734.2025.02041).
Code available at https://github.com/kocurvik/threeview and
https://doi.org/10.5281/zenodo.16599943. Data available at
https://doi.org/10.5281/zenodo.16603086