Computer Vision and Pattern Recognition 162
☆ HoloCine: Holistic Generation of Cinematic Multi-Shot Long Video Narratives
Yihao Meng, Hao Ouyang, Yue Yu, Qiuyu Wang, Wen Wang, Ka Leong Cheng, Hanlin Wang, Yixuan Li, Cheng Chen, Yanhong Zeng, Yujun Shen, Huamin Qu
State-of-the-art text-to-video models excel at generating isolated clips but
fall short of creating the coherent, multi-shot narratives, which are the
essence of storytelling. We bridge this "narrative gap" with HoloCine, a model
that generates entire scenes holistically to ensure global consistency from the
first shot to the last. Our architecture achieves precise directorial control
through a Window Cross-Attention mechanism that localizes text prompts to
specific shots, while a Sparse Inter-Shot Self-Attention pattern (dense within
shots but sparse between them) ensures the efficiency required for minute-scale
generation. Beyond setting a new state-of-the-art in narrative coherence,
HoloCine develops remarkable emergent abilities: a persistent memory for
characters and scenes, and an intuitive grasp of cinematic techniques. Our work
marks a pivotal shift from clip synthesis towards automated filmmaking, making
end-to-end cinematic creation a tangible future. Our code is available at:
https://holo-cine.github.io/.
comment: Project page and code: https://holo-cine.github.io/
☆ LayerComposer: Interactive Personalized T2I via Spatially-Aware Layered Canvas
Guocheng Gordon Qian, Ruihang Zhang, Tsai-Shien Chen, Yusuf Dalva, Anujraaj Argo Goyal, Willi Menapace, Ivan Skorokhodov, Meng Dong, Arpit Sahni, Daniil Ostashev, Ju Hu, Sergey Tulyakov, Kuan-Chieh Jackson Wang
Despite their impressive visual fidelity, existing personalized generative
models lack interactive control over spatial composition and scale poorly to
multiple subjects. To address these limitations, we present LayerComposer, an
interactive framework for personalized, multi-subject text-to-image generation.
Our approach introduces two main contributions: (1) a layered canvas, a novel
representation in which each subject is placed on a distinct layer, enabling
occlusion-free composition; and (2) a locking mechanism that preserves selected
layers with high fidelity while allowing the remaining layers to adapt flexibly
to the surrounding context. Similar to professional image-editing software, the
proposed layered canvas allows users to place, resize, or lock input subjects
through intuitive layer manipulation. Our versatile locking mechanism requires
no architectural changes, relying instead on inherent positional embeddings
combined with a new complementary data sampling strategy. Extensive experiments
demonstrate that LayerComposer achieves superior spatial control and identity
preservation compared to the state-of-the-art methods in multi-subject
personalized image generation.
comment: 9 pages, preprint
☆ Towards General Modality Translation with Contrastive and Predictive Latent Diffusion Bridge
Recent advances in generative modeling have positioned diffusion models as
state-of-the-art tools for sampling from complex data distributions. While
these models have shown remarkable success across single-modality domains such
as images and audio, extending their capabilities to Modality Translation (MT),
translating information across different sensory modalities, remains an open
challenge. Existing approaches often rely on restrictive assumptions, including
shared dimensionality, Gaussian source priors, and modality-specific
architectures, which limit their generality and theoretical grounding. In this
work, we propose the Latent Denoising Diffusion Bridge Model (LDDBM), a
general-purpose framework for modality translation based on a latent-variable
extension of Denoising Diffusion Bridge Models. By operating in a shared latent
space, our method learns a bridge between arbitrary modalities without
requiring aligned dimensions. We introduce a contrastive alignment loss to
enforce semantic consistency between paired samples and design a
domain-agnostic encoder-decoder architecture tailored for noise prediction in
latent space. Additionally, we propose a predictive loss to guide training
toward accurate cross-domain translation and explore several training
strategies to improve stability. Our approach supports arbitrary modality pairs
and performs strongly on diverse MT tasks, including multi-view to 3D shape
generation, image super-resolution, and multi-view scene synthesis.
Comprehensive experiments and ablations validate the effectiveness of our
framework, establishing a new strong baseline in general modality translation.
For more information, see our project page:
https://sites.google.com/view/lddbm/home.
☆ GSWorld: Closed-Loop Photo-Realistic Simulation Suite for Robotic Manipulation
Guangqi Jiang, Haoran Chang, Ri-Zhao Qiu, Yutong Liang, Mazeyu Ji, Jiyue Zhu, Zhao Dong, Xueyan Zou, Xiaolong Wang
This paper presents GSWorld, a robust, photo-realistic simulator for robotics
manipulation that combines 3D Gaussian Splatting with physics engines. Our
framework advocates "closing the loop" of developing manipulation policies with
reproducible evaluation of policies learned from real-robot data and sim2real
policy training without using real robots. To enable photo-realistic rendering
of diverse scenes, we propose a new asset format, which we term GSDF (Gaussian
Scene Description File), that infuses Gaussian-on-Mesh representation with
robot URDF and other objects. With a streamlined reconstruction pipeline, we
curate a database of GSDF that contains 3 robot embodiments for single-arm and
bimanual manipulation, as well as more than 40 objects. Combining GSDF with
physics engines, we demonstrate several immediate interesting applications: (1)
learning zero-shot sim2real pixel-to-action manipulation policy with
photo-realistic rendering, (2) automated high-quality DAgger data collection
for adapting policies to deployment environments, (3) reproducible benchmarking
of real-robot manipulation policies in simulation, (4) simulation data
collection by virtual teleoperation, and (5) zero-shot sim2real visual
reinforcement learning. Website: https://3dgsworld.github.io/.
☆ SpectraMorph: Structured Latent Learning for Self-Supervised Hyperspectral Super-Resolution
Hyperspectral sensors capture dense spectra per pixel but suffer from low
spatial resolution, causing blurred boundaries and mixed-pixel effects.
Co-registered companion sensors such as multispectral, RGB, or panchromatic
cameras provide high-resolution spatial detail, motivating hyperspectral
super-resolution through the fusion of hyperspectral and multispectral images
(HSI-MSI). Existing deep learning based methods achieve strong performance but
rely on opaque regressors that lack interpretability and often fail when the
MSI has very few bands. We propose SpectraMorph, a physics-guided
self-supervised fusion framework with a structured latent space. Instead of
direct regression, SpectraMorph enforces an unmixing bottleneck: endmember
signatures are extracted from the low-resolution HSI, and a compact multilayer
perceptron predicts abundance-like maps from the MSI. Spectra are reconstructed
by linear mixing, with training performed in a self-supervised manner via the
MSI sensor's spectral response function. SpectraMorph produces interpretable
intermediates, trains in under a minute, and remains robust even with a
single-band (pan-chromatic) MSI. Experiments on synthetic and real-world
datasets show SpectraMorph consistently outperforming state-of-the-art
unsupervised/self-supervised baselines while remaining very competitive against
supervised baselines.
☆ Small Drafts, Big Verdict: Information-Intensive Visual Reasoning via Speculation
Large Vision-Language Models (VLMs) have achieved remarkable progress in
multimodal understanding, yet they struggle when reasoning over
information-intensive images that densely interleave textual annotations with
fine-grained graphical elements. The main challenges lie in precisely
localizing critical cues in dense layouts and multi-hop reasoning to integrate
dispersed evidence. We propose Speculative Verdict (SV), a training-free
framework inspired by speculative decoding that combines multiple lightweight
draft experts with a large verdict model. In the draft stage, small VLMs act as
draft experts to generate reasoning paths that provide diverse localization
candidates; in the verdict stage, a strong VLM synthesizes these paths to
produce the final answer, minimizing computational cost while recovering
correct answers. To further improve efficiency and accuracy, SV introduces a
consensus expert selection mechanism that forwards only high-agreement
reasoning paths to the verdict. Empirically, SV achieves consistent gains on
challenging information-intensive and high-resolution visual question answering
benchmarks, including InfographicVQA, ChartMuseum, ChartQAPro, and HR-Bench 4K.
By synthesizing correct insights from multiple partially accurate reasoning
paths, SV achieves both error correction and cost-efficiency compared to large
proprietary models or training pipelines. Code is available at
https://github.com/Tinaliu0123/speculative-verdict
☆ Real Deep Research for AI, Robotics and Beyond
Xueyan Zou, Jianglong Ye, Hao Zhang, Xiaoyu Xiang, Mingyu Ding, Zhaojing Yang, Yong Jae Lee, Zhuowen Tu, Sifei Liu, Xiaolong Wang
With the rapid growth of research in AI and robotics now producing over
10,000 papers annually it has become increasingly difficult for researchers to
stay up to date. Fast evolving trends, the rise of interdisciplinary work, and
the need to explore domains beyond one's expertise all contribute to this
challenge. To address these issues, we propose a generalizable pipeline capable
of systematically analyzing any research area: identifying emerging trends,
uncovering cross domain opportunities, and offering concrete starting points
for new inquiry. In this work, we present Real Deep Research (RDR) a
comprehensive framework applied to the domains of AI and robotics, with a
particular focus on foundation models and robotics advancements. We also
briefly extend our analysis to other areas of science. The main paper details
the construction of the RDR pipeline, while the appendix provides extensive
results across each analyzed topic. We hope this work sheds light for
researchers working in the field of AI and beyond.
comment: website: https://realdeepresearch.github.io
☆ Video Prediction of Dynamic Physical Simulations With Pixel-Space Spatiotemporal Transformers
Inspired by the performance and scalability of autoregressive large language
models (LLMs), transformer-based models have seen recent success in the visual
domain. This study investigates a transformer adaptation for video prediction
with a simple end-to-end approach, comparing various spatiotemporal
self-attention layouts. Focusing on causal modeling of physical simulations
over time; a common shortcoming of existing video-generative approaches, we
attempt to isolate spatiotemporal reasoning via physical object tracking
metrics and unsupervised training on physical simulation datasets. We introduce
a simple yet effective pure transformer model for autoregressive video
prediction, utilizing continuous pixel-space representations for video
prediction. Without the need for complex training strategies or latent
feature-learning components, our approach significantly extends the time
horizon for physically accurate predictions by up to 50% when compared with
existing latent-space approaches, while maintaining comparable performance on
common video quality metrics. In addition, we conduct interpretability
experiments to identify network regions that encode information useful to
perform accurate estimations of PDE simulation parameters via probing models,
and find that this generalizes to the estimation of out-of-distribution
simulation parameters. This work serves as a platform for further
attention-based spatiotemporal modeling of videos via a simple, parameter
efficient, and interpretable approach.
comment: 14 pages, 14 figures
☆ ARGenSeg: Image Segmentation with Autoregressive Image Generation Model NeurIPS 2025
We propose a novel AutoRegressive Generation-based paradigm for image
Segmentation (ARGenSeg), achieving multimodal understanding and pixel-level
perception within a unified framework. Prior works integrating image
segmentation into multimodal large language models (MLLMs) typically employ
either boundary points representation or dedicated segmentation heads. These
methods rely on discrete representations or semantic prompts fed into
task-specific decoders, which limits the ability of the MLLM to capture
fine-grained visual details. To address these challenges, we introduce a
segmentation framework for MLLM based on image generation, which naturally
produces dense masks for target objects. We leverage MLLM to output visual
tokens and detokenize them into images using an universal VQ-VAE, making the
segmentation fully dependent on the pixel-level understanding of the MLLM. To
reduce inference latency, we employ a next-scale-prediction strategy to
generate required visual tokens in parallel. Extensive experiments demonstrate
that our method surpasses prior state-of-the-art approaches on multiple
segmentation datasets with a remarkable boost in inference speed, while
maintaining strong understanding capabilities.
comment: Accepted to NeurIPS 2025, 18 pages
☆ Compress to Impress: Efficient LLM Adaptation Using a Single Gradient Step on 100 Samples
Recently, Sharma et al. suggested a method called Layer-SElective-Rank
reduction (LASER) which demonstrated that pruning high-order components of
carefully chosen LLM's weight matrices can boost downstream accuracy -- without
any gradient-based fine-tuning. Yet LASER's exhaustive, per-matrix search (each
requiring full-dataset forward passes) makes it impractical for rapid
deployment. We demonstrate that this overhead can be removed and find that: (i)
Only a small, carefully chosen subset of matrices needs to be inspected --
eliminating the layer-by-layer sweep, (ii) The gradient of each matrix's
singular values pinpoints which matrices merit reduction, (iii) Increasing the
factorization search space by allowing matrices rows to cluster around multiple
subspaces and then decomposing each cluster separately further reduces
overfitting on the original training data and further lifts accuracy by up to
24.6 percentage points, and finally, (iv) we discover that evaluating on just
100 samples rather than the full training data -- both for computing the
indicative gradients and for measuring the final accuracy -- suffices to
further reduce the search time; we explain that as adaptation to downstream
tasks is dominated by prompting style, not dataset size. As a result, we show
that combining these findings yields a fast and robust adaptation algorithm for
downstream tasks. Overall, with a single gradient step on 100 examples and a
quick scan of the top candidate layers and factorization techniques, we can
adapt LLMs to new datasets -- entirely without fine-tuning.
☆ Radar-Camera Fused Multi-Object Tracking: Online Calibration and Common Feature IEEE
This paper presents a Multi-Object Tracking (MOT) framework that fuses radar
and camera data to enhance tracking efficiency while minimizing manual
interventions. Contrary to many studies that underutilize radar and assign it a
supplementary role--despite its capability to provide accurate range/depth
information of targets in a world 3D coordinate system--our approach positions
radar in a crucial role. Meanwhile, this paper utilizes common features to
enable online calibration to autonomously associate detections from radar and
camera. The main contributions of this work include: (1) the development of a
radar-camera fusion MOT framework that exploits online radar-camera calibration
to simplify the integration of detection results from these two sensors, (2)
the utilization of common features between radar and camera data to accurately
derive real-world positions of detected objects, and (3) the adoption of
feature matching and category-consistency checking to surpass the limitations
of mere position matching in enhancing sensor association accuracy. To the best
of our knowledge, we are the first to investigate the integration of
radar-camera common features and their use in online calibration for achieving
MOT. The efficacy of our framework is demonstrated by its ability to streamline
the radar-camera mapping process and improve tracking precision, as evidenced
by real-world experiments conducted in both controlled environments and actual
traffic scenarios. Code is available at
https://github.com/radar-lab/Radar_Camera_MOT
comment: accepted to IEEE Transactions on Intelligent Transportation Systems
(T-ITS)
☆ CUPID: Pose-Grounded Generative 3D Reconstruction from a Single Image
This work proposes a new generation-based 3D reconstruction method, named
Cupid, that accurately infers the camera pose, 3D shape, and texture of an
object from a single 2D image. Cupid casts 3D reconstruction as a conditional
sampling process from a learned distribution of 3D objects, and it jointly
generates voxels and pixel-voxel correspondences, enabling robust pose and
shape estimation under a unified generative framework. By representing both
input camera poses and 3D shape as a distribution in a shared 3D latent space,
Cupid adopts a two-stage flow matching pipeline: (1) a coarse stage that
produces initial 3D geometry with associated 2D projections for pose recovery;
and (2) a refinement stage that integrates pose-aligned image features to
enhance structural fidelity and appearance details. Extensive experiments
demonstrate Cupid outperforms leading 3D reconstruction methods with an over 3
dB PSNR gain and an over 10% Chamfer Distance reduction, while matching
monocular estimators on pose accuracy and delivering superior visual fidelity
over baseline 3D generative models. For an immersive view of the 3D results
generated by Cupid, please visit cupid3d.github.io.
comment: project page at https://cupid3d.github.io
☆ AlphaFlow: Understanding and Improving MeanFlow Models
Huijie Zhang, Aliaksandr Siarohin, Willi Menapace, Michael Vasilkovsky, Sergey Tulyakov, Qing Qu, Ivan Skorokhodov
MeanFlow has recently emerged as a powerful framework for few-step generative
modeling trained from scratch, but its success is not yet fully understood. In
this work, we show that the MeanFlow objective naturally decomposes into two
parts: trajectory flow matching and trajectory consistency. Through gradient
analysis, we find that these terms are strongly negatively correlated, causing
optimization conflict and slow convergence. Motivated by these insights, we
introduce $\alpha$-Flow, a broad family of objectives that unifies trajectory
flow matching, Shortcut Model, and MeanFlow under one formulation. By adopting
a curriculum strategy that smoothly anneals from trajectory flow matching to
MeanFlow, $\alpha$-Flow disentangles the conflicting objectives, and achieves
better convergence. When trained from scratch on class-conditional ImageNet-1K
256x256 with vanilla DiT backbones, $\alpha$-Flow consistently outperforms
MeanFlow across scales and settings. Our largest $\alpha$-Flow-XL/2+ model
achieves new state-of-the-art results using vanilla DiT backbones, with FID
scores of 2.58 (1-NFE) and 2.15 (2-NFE).
☆ DyPE: Dynamic Position Extrapolation for Ultra High Resolution Diffusion
Diffusion Transformer models can generate images with remarkable fidelity and
detail, yet training them at ultra-high resolutions remains extremely costly
due to the self-attention mechanism's quadratic scaling with the number of
image tokens. In this paper, we introduce Dynamic Position Extrapolation
(DyPE), a novel, training-free method that enables pre-trained diffusion
transformers to synthesize images at resolutions far beyond their training
data, with no additional sampling cost. DyPE takes advantage of the spectral
progression inherent to the diffusion process, where low-frequency structures
converge early, while high-frequencies take more steps to resolve.
Specifically, DyPE dynamically adjusts the model's positional encoding at each
diffusion step, matching their frequency spectrum with the current stage of the
generative process. This approach allows us to generate images at resolutions
that exceed the training resolution dramatically, e.g., 16 million pixels using
FLUX. On multiple benchmarks, DyPE consistently improves performance and
achieves state-of-the-art fidelity in ultra-high-resolution image generation,
with gains becoming even more pronounced at higher resolutions. Project page is
available at https://noamissachar.github.io/DyPE/.
☆ MEIcoder: Decoding Visual Stimuli from Neural Activity by Leveraging Most Exciting Inputs NeurIPS 2025
Decoding visual stimuli from neural population activity is crucial for
understanding the brain and for applications in brain-machine interfaces.
However, such biological data is often scarce, particularly in primates or
humans, where high-throughput recording techniques, such as two-photon imaging,
remain challenging or impossible to apply. This, in turn, poses a challenge for
deep learning decoding techniques. To overcome this, we introduce MEIcoder, a
biologically informed decoding method that leverages neuron-specific most
exciting inputs (MEIs), a structural similarity index measure loss, and
adversarial training. MEIcoder achieves state-of-the-art performance in
reconstructing visual stimuli from single-cell activity in primary visual
cortex (V1), especially excelling on small datasets with fewer recorded
neurons. Using ablation studies, we demonstrate that MEIs are the main drivers
of the performance, and in scaling experiments, we show that MEIcoder can
reconstruct high-fidelity natural-looking images from as few as 1,000-2,500
neurons and less than 1,000 training data points. We also propose a unified
benchmark with over 160,000 samples to foster future research. Our results
demonstrate the feasibility of reliable decoding in early visual system and
provide practical insights for neuroscience and neuroengineering applications.
comment: Accepted to NeurIPS 2025
☆ ACS-SegNet: An Attention-Based CNN-SegFormer Segmentation Network for Tissue Segmentation in Histopathology
Automated histopathological image analysis plays a vital role in
computer-aided diagnosis of various diseases. Among developed algorithms, deep
learning-based approaches have demonstrated excellent performance in multiple
tasks, including semantic tissue segmentation in histological images. In this
study, we propose a novel approach based on attention-driven feature fusion of
convolutional neural networks (CNNs) and vision transformers (ViTs) within a
unified dual-encoder model to improve semantic segmentation performance.
Evaluation on two publicly available datasets showed that our model achieved
{\mu}IoU/{\mu}Dice scores of 76.79%/86.87% on the GCPS dataset and
64.93%/76.60% on the PUMA dataset, outperforming state-of-the-art and baseline
benchmarks. The implementation of our method is publicly available in a GitHub
repository: https://github.com/NimaTorbati/ACS-SegNet
comment: 5 pages
☆ AutoScape: Geometry-Consistent Long-Horizon Scene Generation ICCV 2025
Jiacheng Chen, Ziyu Jiang, Mingfu Liang, Bingbing Zhuang, Jong-Chyi Su, Sparsh Garg, Ying Wu, Manmohan Chandraker
This paper proposes AutoScape, a long-horizon driving scene generation
framework. At its core is a novel RGB-D diffusion model that iteratively
generates sparse, geometrically consistent keyframes, serving as reliable
anchors for the scene's appearance and geometry. To maintain long-range
geometric consistency, the model 1) jointly handles image and depth in a shared
latent space, 2) explicitly conditions on the existing scene geometry (i.e.,
rendered point clouds) from previously generated keyframes, and 3) steers the
sampling process with a warp-consistent guidance. Given high-quality RGB-D
keyframes, a video diffusion model then interpolates between them to produce
dense and coherent video frames. AutoScape generates realistic and
geometrically consistent driving videos of over 20 seconds, improving the
long-horizon FID and FVD scores over the prior state-of-the-art by 48.6\% and
43.0\%, respectively.
comment: ICCV 2025. Project page: https://auto-scape.github.io
☆ ALICE-LRI: A General Method for Lossless Range Image Generation for Spinning LiDAR Sensors without Calibration Metadata
Samuel Soutullo, Miguel Yermo, David L. Vilariño, Óscar G. Lorenzo, José C. Cabaleiro, Francisco F. Rivera
3D LiDAR sensors are essential for autonomous navigation, environmental
monitoring, and precision mapping in remote sensing applications. To
efficiently process the massive point clouds generated by these sensors, LiDAR
data is often projected into 2D range images that organize points by their
angular positions and distances. While these range image representations enable
efficient processing, conventional projection methods suffer from fundamental
geometric inconsistencies that cause irreversible information loss,
compromising high-fidelity applications. We present ALICE-LRI (Automatic LiDAR
Intrinsic Calibration Estimation for Lossless Range Images), the first general,
sensor-agnostic method that achieves lossless range image generation from
spinning LiDAR point clouds without requiring manufacturer metadata or
calibration files. Our algorithm automatically reverse-engineers the intrinsic
geometry of any spinning LiDAR sensor by inferring critical parameters
including laser beam configuration, angular distributions, and per-beam
calibration corrections, enabling lossless projection and complete point cloud
reconstruction with zero point loss. Comprehensive evaluation across the
complete KITTI and DurLAR datasets demonstrates that ALICE-LRI achieves perfect
point preservation, with zero points lost across all point clouds. Geometric
accuracy is maintained well within sensor precision limits, establishing
geometric losslessness with real-time performance. We also present a
compression case study that validates substantial downstream benefits,
demonstrating significant quality improvements in practical applications. This
paradigm shift from approximate to lossless LiDAR projections opens new
possibilities for high-precision remote sensing applications requiring complete
geometric preservation.
☆ Mixing Importance with Diversity: Joint Optimization for KV Cache Compression in Large Vision-Language Models
Recent large vision-language models (LVLMs) demonstrate remarkable
capabilities in processing extended multi-modal sequences, yet the resulting
key-value (KV) cache expansion creates a critical memory bottleneck that
fundamentally limits deployment scalability. While existing KV cache
compression methods focus on retaining high-importance KV pairs to minimize
storage, they often overlook the modality-specific semantic redundancy patterns
that emerge distinctively in multi-modal KV caches. In this work, we first
analyze how, beyond simple importance, the KV cache in LVLMs exhibits varying
levels of redundancy across attention heads. We show that relying solely on
importance can only cover a subset of the full KV cache information
distribution, leading to potential loss of semantic coverage. To address this,
we propose \texttt{MixKV}, a novel method that mixes importance with diversity
for optimized KV cache compression in LVLMs. \texttt{MixKV} adapts to head-wise
semantic redundancy, selectively balancing diversity and importance when
compressing KV pairs. Extensive experiments demonstrate that \texttt{MixKV}
consistently enhances existing methods across multiple LVLMs. Under extreme
compression (budget=64), \texttt{MixKV} improves baseline methods by an average
of \textbf{5.1\%} across five multi-modal understanding benchmarks and achieves
remarkable gains of \textbf{8.0\%} and \textbf{9.0\%} for SnapKV and AdaKV on
GUI grounding tasks, all while maintaining comparable inference efficiency.
Furthermore, \texttt{MixKV} extends seamlessly to LLMs with comparable
performance gains. Our code is available at
\href{https://github.com/xuyang-liu16/MixKV}{\textcolor{citeblue}{https://github.com/xuyang-liu16/MixKV}}.
comment: Our code is available at https://github.com/xuyang-liu16/MixKV
☆ Diagnosing Visual Reasoning: Challenges, Insights, and a Path Forward
Multimodal large language models (MLLMs) that integrate visual and textual
reasoning leverage chain-of-thought (CoT) prompting to tackle complex visual
tasks, yet continue to exhibit visual hallucinations and an over-reliance on
textual priors. We present a systematic diagnosis of state-of-the-art
vision-language models using a three-stage evaluation framework, uncovering key
failure modes. To address these, we propose an agent-based architecture that
combines LLM reasoning with lightweight visual modules, enabling fine-grained
analysis and iterative refinement of reasoning chains. Our results highlight
future visual reasoning models should focus on integrating a broader set of
specialized tools for analyzing visual content. Our system achieves significant
gains (+10.3 on MMMU, +6.0 on MathVista over a 7B baseline), matching or
surpassing much larger models. We will release our framework and evaluation
suite to facilitate future research.
comment: 5 pages
☆ Efficient Multi-bit Quantization Network Training via Weight Bias Correction and Bit-wise Coreset Sampling
Multi-bit quantization networks enable flexible deployment of deep neural
networks by supporting multiple precision levels within a single model.
However, existing approaches suffer from significant training overhead as
full-dataset updates are repeated for each supported bit-width, resulting in a
cost that scales linearly with the number of precisions. Additionally, extra
fine-tuning stages are often required to support additional or intermediate
precision options, further compounding the overall training burden. To address
this issue, we propose two techniques that greatly reduce the training overhead
without compromising model utility: (i) Weight bias correction enables shared
batch normalization and eliminates the need for fine-tuning by neutralizing
quantization-induced bias across bit-widths and aligning activation
distributions; and (ii) Bit-wise coreset sampling strategy allows each child
model to train on a compact, informative subset selected via gradient-based
importance scores by exploiting the implicit knowledge transfer phenomenon.
Experiments on CIFAR-10/100, TinyImageNet, and ImageNet-1K with both ResNet and
ViT architectures demonstrate that our method achieves competitive or superior
accuracy while reducing training time up to 7.88x. Our code is released at
https://github.com/a2jinhee/EMQNet_jk.
☆ HybridSOMSpikeNet: A Deep Model with Differentiable Soft Self-Organizing Maps and Spiking Dynamics for Waste Classification
Accurate waste classification is vital for achieving sustainable waste
management and reducing the environmental footprint of urbanization.
Misclassification of recyclable materials contributes to landfill accumulation,
inefficient recycling, and increased greenhouse gas emissions. To address these
issues, this study introduces HybridSOMSpikeNet, a hybrid deep learning
framework that integrates convolutional feature extraction, differentiable
self-organization, and spiking-inspired temporal processing to enable
intelligent and energy-efficient waste classification. The proposed model
employs a pre-trained ResNet-152 backbone to extract deep spatial
representations, followed by a Differentiable Soft Self-Organizing Map
(Soft-SOM) that enhances topological clustering and interpretability. A spiking
neural head accumulates temporal activations over discrete time steps,
improving robustness and generalization. Trained on a ten-class waste dataset,
HybridSOMSpikeNet achieved a test accuracy of 97.39%, outperforming several
state-of-the-art architectures while maintaining a lightweight computational
profile suitable for real-world deployment. Beyond its technical innovations,
the framework provides tangible environmental benefits. By enabling precise and
automated waste segregation, it supports higher recycling efficiency, reduces
contamination in recyclable streams, and minimizes the ecological and
operational costs of waste processing. The approach aligns with global
sustainability priorities, particularly the United Nations Sustainable
Development Goals (SDG 11 and SDG 12), by contributing to cleaner cities,
circular economy initiatives, and intelligent environmental management systems.
☆ UltraHR-100K: Enhancing UHR Image Synthesis with A Large-Scale High-Quality Dataset NeurIPS 2025
Ultra-high-resolution (UHR) text-to-image (T2I) generation has seen notable
progress. However, two key challenges remain : 1) the absence of a large-scale
high-quality UHR T2I dataset, and (2) the neglect of tailored training
strategies for fine-grained detail synthesis in UHR scenarios. To tackle the
first challenge, we introduce \textbf{UltraHR-100K}, a high-quality dataset of
100K UHR images with rich captions, offering diverse content and strong visual
fidelity. Each image exceeds 3K resolution and is rigorously curated based on
detail richness, content complexity, and aesthetic quality. To tackle the
second challenge, we propose a frequency-aware post-training method that
enhances fine-detail generation in T2I diffusion models. Specifically, we
design (i) \textit{Detail-Oriented Timestep Sampling (DOTS)} to focus learning
on detail-critical denoising steps, and (ii) \textit{Soft-Weighting Frequency
Regularization (SWFR)}, which leverages Discrete Fourier Transform (DFT) to
softly constrain frequency components, encouraging high-frequency detail
preservation. Extensive experiments on our proposed UltraHR-eval4K benchmarks
demonstrate that our approach significantly improves the fine-grained detail
quality and overall fidelity of UHR image generation. The code is available at
\href{https://github.com/NJU-PCALab/UltraHR-100k}{here}.
comment: Accepted by NeurIPS 2025
☆ Better Tokens for Better 3D: Advancing Vision-Language Modeling in 3D Medical Imaging NeurIPS 2025
Ibrahim Ethem Hamamci, Sezgin Er, Suprosanna Shit, Hadrien Reynaud, Dong Yang, Pengfei Guo, Marc Edgar, Daguang Xu, Bernhard Kainz, Bjoern Menze
Recent progress in vision-language modeling for 3D medical imaging has been
fueled by large-scale computed tomography (CT) corpora with paired free-text
reports, stronger architectures, and powerful pretrained models. This has
enabled applications such as automated report generation and text-conditioned
3D image synthesis. Yet, current approaches struggle with high-resolution,
long-sequence volumes: contrastive pretraining often yields vision encoders
that are misaligned with clinical language, and slice-wise tokenization blurs
fine anatomy, reducing diagnostic performance on downstream tasks. We introduce
BTB3D (Better Tokens for Better 3D), a causal convolutional encoder-decoder
that unifies 2D and 3D training and inference while producing compact,
frequency-aware volumetric tokens. A three-stage training curriculum enables
(i) local reconstruction, (ii) overlapping-window tiling, and (iii)
long-context decoder refinement, during which the model learns from short slice
excerpts yet generalizes to scans exceeding 300 slices without additional
memory overhead. BTB3D sets a new state-of-the-art on two key tasks: it
improves BLEU scores and increases clinical F1 by 40% over CT2Rep, CT-CHAT, and
Merlin for report generation; and it reduces FID by 75% and halves FVD compared
to GenerateCT and MedSyn for text-to-CT synthesis, producing anatomically
consistent 512*512*241 volumes. These results confirm that precise
three-dimensional tokenization, rather than larger language backbones alone, is
essential for scalable vision-language modeling in 3D medical imaging. The
codebase is available at: https://github.com/ibrahimethemhamamci/BTB3D
comment: NeurIPS 2025
☆ Deep Learning in Dental Image Analysis: A Systematic Review of Datasets, Methodologies, and Emerging Challenges
Efficient analysis and processing of dental images are crucial for dentists
to achieve accurate diagnosis and optimal treatment planning. However, dental
imaging inherently poses several challenges, such as low contrast, metallic
artifacts, and variations in projection angles. Combined with the subjectivity
arising from differences in clinicians' expertise, manual interpretation often
proves time-consuming and prone to inconsistency. Artificial intelligence
(AI)-based automated dental image analysis (DIA) offers a promising solution to
these issues and has become an integral part of computer-aided dental diagnosis
and treatment. Among various AI technologies, deep learning (DL) stands out as
the most widely applied and influential approach due to its superior feature
extraction and representation capabilities. To comprehensively summarize recent
progress in this field, we focus on the two fundamental aspects of DL
research-datasets and models. In this paper, we systematically review 260
studies on DL applications in DIA, including 49 papers on publicly available
dental datasets and 211 papers on DL-based algorithms. We first introduce the
basic concepts of dental imaging and summarize the characteristics and
acquisition methods of existing datasets. Then, we present the foundational
techniques of DL and categorize relevant models and algorithms according to
different DIA tasks, analyzing their network architectures, optimization
strategies, training methods, and performance. Furthermore, we summarize
commonly used training and evaluation metrics in the DIA domain. Finally, we
discuss the current challenges of existing research and outline potential
future directions. We hope that this work provides a valuable and systematic
reference for researchers in this field. All supplementary materials and
detailed comparison tables will be made publicly available on GitHub.
comment: 52 pages, 24 figures. Under Review
☆ SeViCES: Unifying Semantic-Visual Evidence Consensus for Long Video Understanding
Long video understanding remains challenging due to its complex, diverse, and
temporally scattered content. Although video large language models (Video-LLMs)
can process videos lasting tens of minutes, applying them to truly long
sequences is computationally prohibitive and often leads to unfocused or
inconsistent reasoning. A promising solution is to select only the most
informative frames, yet existing approaches typically ignore temporal
dependencies or rely on unimodal evidence, limiting their ability to provide
complete and query-relevant context. We propose a Semantic-Visual Consensus
Evidence Selection (SeViCES) framework for effective and reliable long video
understanding. SeViCES is training-free and model-agnostic, and introduces two
key components. The Semantic-Visual Consensus Frame Selection (SVCFS) module
selects frames through (1) a temporal-aware semantic branch that leverages LLM
reasoning over captions, and (2) a cluster-guided visual branch that aligns
embeddings with semantic scores via mutual information. The Answer Consensus
Refinement (ACR) module further resolves inconsistencies between semantic- and
visual-based predictions by fusing evidence and constraining the answer space.
Extensive experiments on long video understanding benchmarks show that SeViCES
consistently outperforms state-of-the-art methods in both accuracy and
robustness, demonstrating the importance of consensus-driven evidence selection
for Video-LLMs.
☆ OnlineSplatter: Pose-Free Online 3D Reconstruction for Free-Moving Objects NeurIPS 2025
Free-moving object reconstruction from monocular video remains challenging,
particularly without reliable pose or depth cues and under arbitrary object
motion. We introduce OnlineSplatter, a novel online feed-forward framework
generating high-quality, object-centric 3D Gaussians directly from RGB frames
without requiring camera pose, depth priors, or bundle optimization. Our
approach anchors reconstruction using the first frame and progressively refines
the object representation through a dense Gaussian primitive field, maintaining
constant computational cost regardless of video sequence length. Our core
contribution is a dual-key memory module combining latent appearance-geometry
keys with explicit directional keys, robustly fusing current frame features
with temporally aggregated object states. This design enables effective
handling of free-moving objects via spatial-guided memory readout and an
efficient sparsification mechanism, ensuring comprehensive yet compact object
coverage. Evaluations on real-world datasets demonstrate that OnlineSplatter
significantly outperforms state-of-the-art pose-free reconstruction baselines,
consistently improving with more observations while maintaining constant memory
and runtime.
comment: NeurIPS 2025 (Spotlight)
☆ Unsupervised Domain Adaptation via Similarity-based Prototypes for Cross-Modality Segmentation MICCAI 2021
Deep learning models have achieved great success on various vision
challenges, but a well-trained model would face drastic performance degradation
when applied to unseen data. Since the model is sensitive to domain shift,
unsupervised domain adaptation attempts to reduce the domain gap and avoid
costly annotation of unseen domains. This paper proposes a novel framework for
cross-modality segmentation via similarity-based prototypes. In specific, we
learn class-wise prototypes within an embedding space, then introduce a
similarity constraint to make these prototypes representative for each semantic
class while separable from different classes. Moreover, we use dictionaries to
store prototypes extracted from different images, which prevents the
class-missing problem and enables the contrastive learning of prototypes, and
further improves performance. Extensive experiments show that our method
achieves better results than other state-of-the-art methods.
comment: MICCAI 2021
☆ GenColorBench: A Color Evaluation Benchmark for Text-to-Image Generation Models
Muhammad Atif Butt, Alexandra Gomez-Villa, Tao Wu, Javier Vazquez-Corral, Joost Van De Weijer, Kai Wang
Recent years have seen impressive advances in text-to-image generation, with
image generative or unified models producing high-quality images from text. Yet
these models still struggle with fine-grained color controllability, often
failing to accurately match colors specified in text prompts. While existing
benchmarks evaluate compositional reasoning and prompt adherence, none
systematically assess color precision. Color is fundamental to human visual
perception and communication, critical for applications from art to design
workflows requiring brand consistency. However, current benchmarks either
neglect color or rely on coarse assessments, missing key capabilities such as
interpreting RGB values or aligning with human expectations. To this end, we
propose GenColorBench, the first comprehensive benchmark for text-to-image
color generation, grounded in color systems like ISCC-NBS and CSS3/X11,
including numerical colors which are absent elsewhere. With 44K color-focused
prompts covering 400+ colors, it reveals models' true capabilities via
perceptual and automated assessments. Evaluations of popular text-to-image
models using GenColorBench show performance variations, highlighting which
color conventions models understand best and identifying failure modes. Our
GenColorBench assessments will guide improvements in precise color generation.
The benchmark will be made public upon acceptance.
☆ Open-o3 Video: Grounded Video Reasoning with Explicit Spatio-Temporal Evidence
Jiahao Meng, Xiangtai Li, Haochen Wang, Yue Tan, Tao Zhang, Lingdong Kong, Yunhai Tong, Anran Wang, Zhiyang Teng, Yujing Wang, Zhuochen Wang
Most video reasoning models only generate textual reasoning traces without
indicating when and where key evidence appears. Recent models such as OpenAI-o3
have sparked wide interest in evidence-centered reasoning for images, yet
extending this ability to videos is more challenging, as it requires joint
temporal tracking and spatial localization across dynamic scenes. We introduce
Open-o3 Video, a non-agent framework that integrates explicit spatio-temporal
evidence into video reasoning, and carefully collect training data and design
training strategies to address the aforementioned challenges. The model
highlights key timestamps, objects, and bounding boxes alongside its answers,
allowing reasoning to be grounded in concrete visual observations. To enable
this functionality, we first curate and build two high-quality datasets,
STGR-CoT-30k for SFT and STGR-RL-36k for RL, with carefully constructed
temporal and spatial annotations, since most existing datasets offer either
temporal spans for videos or spatial boxes on images, lacking unified
spatio-temporal supervision and reasoning traces. Then, we adopt a cold-start
reinforcement learning strategy with multiple specially designed rewards that
jointly encourage answer accuracy, temporal alignment, and spatial precision.
On V-STAR benchmark, Open-o3 Video achieves state-of-the-art performance,
raising mAM by 14.4% and mLGM by 24.2% on the Qwen2.5-VL baseline. Consistent
improvements are also observed on a broad range of video understanding
benchmarks, including VideoMME, WorldSense, VideoMMMU, and TVGBench. Beyond
accuracy, the reasoning traces produced by Open-o3 Video also provide valuable
signals for test-time scaling, enabling confidence-aware verification and
improving answer reliability.
☆ EmbodiedBrain: Expanding Performance Boundaries of Task Planning for Embodied Intelligence
Ding Zou, Feifan Wang, Mengyu Ge, Siyuan Fan, Zongbing Zhang, Wei Chen, Lingfeng Wang, Zhongyou Hu, Wenrui Yan, Zhengwei Gao, Hao Wang, Weizhao Jin, Yu Zhang, Hainan Zhao, Mingliang Zhang, Xianxian Xi, Yaru Zhang, Wenyuan Li, Zhengguang Gao, Yurui Zhu
The realization of Artificial General Intelligence (AGI) necessitates
Embodied AI agents capable of robust spatial perception, effective task
planning, and adaptive execution in physical environments. However, current
large language models (LLMs) and multimodal LLMs (MLLMs) for embodied tasks
suffer from key limitations, including a significant gap between model design
and agent requirements, an unavoidable trade-off between real-time latency and
performance, and the use of unauthentic, offline evaluation metrics. To address
these challenges, we propose EmbodiedBrain, a novel vision-language foundation
model available in both 7B and 32B parameter sizes. Our framework features an
agent-aligned data structure and employs a powerful training methodology that
integrates large-scale Supervised Fine-Tuning (SFT) with Step-Augumented Group
Relative Policy Optimization (Step-GRPO), which boosts long-horizon task
success by integrating preceding steps as Guided Precursors. Furthermore, we
incorporate a comprehensive reward system, including a Generative Reward Model
(GRM) accelerated at the infrastructure level, to improve training efficiency.
For enable thorough validation, we establish a three-part evaluation system
encompassing General, Planning, and End-to-End Simulation Benchmarks,
highlighted by the proposal and open-sourcing of a novel, challenging
simulation environment. Experimental results demonstrate that EmbodiedBrain
achieves superior performance across all metrics, establishing a new
state-of-the-art for embodied foundation models. Towards paving the way for the
next generation of generalist embodied agents, we open-source all of our data,
model weight, and evaluating methods, which are available at
https://zterobot.github.io/EmbodiedBrain.github.io.
☆ From Far and Near: Perceptual Evaluation of Crowd Representations Across Levels of Detail
In this paper, we investigate how users perceive the visual quality of crowd
character representations at different levels of detail (LoD) and viewing
distances. Each representation: geometric meshes, image-based impostors, Neural
Radiance Fields (NeRFs), and 3D Gaussians, exhibits distinct trade-offs between
visual fidelity and computational performance. Our qualitative and quantitative
results provide insights to guide the design of perceptually optimized LoD
strategies for crowd rendering.
☆ From Cheap to Pro: A Learning-based Adaptive Camera Parameter Network for Professional-Style Imaging
Consumer-grade camera systems often struggle to maintain stable image quality
under complex illumination conditions such as low light, high dynamic range,
and backlighting, as well as spatial color temperature variation. These issues
lead to underexposure, color casts, and tonal inconsistency, which degrade the
performance of downstream vision tasks. To address this, we propose
ACamera-Net, a lightweight and scene-adaptive camera parameter adjustment
network that directly predicts optimal exposure and white balance from RAW
inputs. The framework consists of two modules: ACamera-Exposure, which
estimates ISO to alleviate underexposure and contrast loss, and ACamera-Color,
which predicts correlated color temperature and gain factors for improved color
consistency. Optimized for real-time inference on edge devices, ACamera-Net can
be seamlessly integrated into imaging pipelines. Trained on diverse real-world
data with annotated references, the model generalizes well across lighting
conditions. Extensive experiments demonstrate that ACamera-Net consistently
enhances image quality and stabilizes perception outputs, outperforming
conventional auto modes and lightweight baselines without relying on additional
image enhancement modules.
comment: 13 pages. Code and project page will be released
☆ Deep Learning-Powered Visual SLAM Aimed at Assisting Visually Impaired Navigation
Despite advancements in SLAM technologies, robust operation under challenging
conditions such as low-texture, motion-blur, or challenging lighting remains an
open challenge. Such conditions are common in applications such as assistive
navigation for the visually impaired. These challenges undermine localization
accuracy and tracking stability, reducing navigation reliability and safety. To
overcome these limitations, we present SELM-SLAM3, a deep learning-enhanced
visual SLAM framework that integrates SuperPoint and LightGlue for robust
feature extraction and matching. We evaluated our framework using TUM RGB-D,
ICL-NUIM, and TartanAir datasets, which feature diverse and challenging
scenarios. SELM-SLAM3 outperforms conventional ORB-SLAM3 by an average of
87.84% and exceeds state-of-the-art RGB-D SLAM systems by 36.77%. Our framework
demonstrates enhanced performance under challenging conditions, such as
low-texture scenes and fast motion, providing a reliable platform for
developing navigation aids for the visually impaired.
comment: 8 pages, 7 figures, 4 tables
☆ Blur2seq: Blind Deblurring and Camera Trajectory Estimation from a Single Camera Motion-blurred Image
Motion blur caused by camera shake, particularly under large or rotational
movements, remains a major challenge in image restoration. We propose a deep
learning framework that jointly estimates the latent sharp image and the
underlying camera motion trajectory from a single blurry image. Our method
leverages the Projective Motion Blur Model (PMBM), implemented efficiently
using a differentiable blur creation module compatible with modern networks. A
neural network predicts a full 3D rotation trajectory, which guides a
model-based restoration network trained end-to-end. This modular architecture
provides interpretability by revealing the camera motion that produced the
blur. Moreover, this trajectory enables the reconstruction of the sequence of
sharp images that generated the observed blurry image. To further refine
results, we optimize the trajectory post-inference via a reblur loss, improving
consistency between the blurry input and the restored output. Extensive
experiments show that our method achieves state-of-the-art performance on both
synthetic and real datasets, particularly in cases with severe or spatially
variant blur, where end-to-end deblurring networks struggle.
Code and trained models are available at
https://github.com/GuillermoCarbajal/Blur2Seq/
☆ Fake-in-Facext: Towards Fine-Grained Explainable DeepFake Analysis
The advancement of Multimodal Large Language Models (MLLMs) has bridged the
gap between vision and language tasks, enabling the implementation of
Explainable DeepFake Analysis (XDFA). However, current methods suffer from a
lack of fine-grained awareness: the description of artifacts in data annotation
is unreliable and coarse-grained, and the models fail to support the output of
connections between textual forgery explanations and the visual evidence of
artifacts, as well as the input of queries for arbitrary facial regions. As a
result, their responses are not sufficiently grounded in Face Visual Context
(Facext). To address this limitation, we propose the Fake-in-Facext (FiFa)
framework, with contributions focusing on data annotation and model
construction. We first define a Facial Image Concept Tree (FICT) to divide
facial images into fine-grained regional concepts, thereby obtaining a more
reliable data annotation pipeline, FiFa-Annotator, for forgery explanation.
Based on this dedicated data annotation, we introduce a novel
Artifact-Grounding Explanation (AGE) task, which generates textual forgery
explanations interleaved with segmentation masks of manipulated artifacts. We
propose a unified multi-task learning architecture, FiFa-MLLM, to
simultaneously support abundant multimodal inputs and outputs for fine-grained
Explainable DeepFake Analysis. With multiple auxiliary supervision tasks,
FiFa-MLLM can outperform strong baselines on the AGE task and achieve SOTA
performance on existing XDFA datasets. The code and data will be made
open-source at https://github.com/lxq1000/Fake-in-Facext.
comment: 25 pages, 9 figures, 17 tables
☆ Metis-HOME: Hybrid Optimized Mixture-of-Experts for Multimodal Reasoning
Inspired by recent advancements in LLM reasoning, the field of multimodal
reasoning has seen remarkable progress, achieving significant performance gains
on intricate tasks such as mathematical problem-solving. Despite this progress,
current multimodal large reasoning models exhibit two key limitations. They
tend to employ computationally expensive reasoning even for simple queries,
leading to inefficiency. Furthermore, this focus on specialized reasoning often
impairs their broader, more general understanding capabilities. In this paper,
we propose Metis-HOME: a Hybrid Optimized Mixture-of-Experts framework designed
to address this trade-off. Metis-HOME enables a ''Hybrid Thinking'' paradigm by
structuring the original dense model into two distinct expert branches: a
thinking branch tailored for complex, multi-step reasoning, and a non-thinking
branch optimized for rapid, direct inference on tasks like general VQA and OCR.
A lightweight, trainable router dynamically allocates queries to the most
suitable expert. We instantiate Metis-HOME by adapting the Qwen2.5-VL-7B into
an MoE architecture. Comprehensive evaluations reveal that our approach not
only substantially enhances complex reasoning abilities but also improves the
model's general capabilities, reversing the degradation trend observed in other
reasoning-specialized models. Our work establishes a new paradigm for building
powerful and versatile MLLMs, effectively resolving the prevalent
reasoning-vs-generalization dilemma.
☆ EchoDistill: Bidirectional Concept Distillation for One-Step Diffusion Personalization
Recent advances in accelerating text-to-image (T2I) diffusion models have
enabled the synthesis of high-fidelity images even in a single step. However,
personalizing these models to incorporate novel concepts remains a challenge
due to the limited capacity of one-step models to capture new concept
distributions effectively. We propose a bidirectional concept distillation
framework, EchoDistill, to enable one-step diffusion personalization (1-SDP).
Our approach involves an end-to-end training process where a multi-step
diffusion model (teacher) and a one-step diffusion model (student) are trained
simultaneously. The concept is first distilled from the teacher model to the
student, and then echoed back from the student to the teacher. During the
EchoDistill, we share the text encoder between the two models to ensure
consistent semantic understanding. Following this, the student model is
optimized with adversarial losses to align with the real image distribution and
with alignment losses to maintain consistency with the teacher's output.
Furthermore, we introduce the bidirectional echoing refinement strategy,
wherein the student model leverages its faster generation capability to
feedback to the teacher model. This bidirectional concept distillation
mechanism not only enhances the student ability to personalize novel concepts
but also improves the generative quality of the teacher model. Our experiments
demonstrate that this collaborative framework significantly outperforms
existing personalization methods over the 1-SDP setup, establishing a novel
paradigm for rapid and effective personalization in T2I diffusion models.
comment: Project page available at
https://liulisixin.github.io/EchoDistill-page/
☆ Reliable and Reproducible Demographic Inference for Fairness in Face Analysis
Fairness evaluation in face analysis systems (FAS) typically depends on
automatic demographic attribute inference (DAI), which itself relies on
predefined demographic segmentation. However, the validity of fairness auditing
hinges on the reliability of the DAI process. We begin by providing a
theoretical motivation for this dependency, showing that improved DAI
reliability leads to less biased and lower-variance estimates of FAS fairness.
To address this, we propose a fully reproducible DAI pipeline that replaces
conventional end-to-end training with a modular transfer learning approach. Our
design integrates pretrained face recognition encoders with non-linear
classification heads. We audit this pipeline across three dimensions: accuracy,
fairness, and a newly introduced notion of robustness, defined via
intra-identity consistency. The proposed robustness metric is applicable to any
demographic segmentation scheme. We benchmark the pipeline on gender and
ethnicity inference across multiple datasets and training setups. Our results
show that the proposed method outperforms strong baselines, particularly on
ethnicity, which is the more challenging attribute. To promote transparency and
reproducibility, we will publicly release the training dataset metadata, full
codebase, pretrained models, and evaluation toolkit. This work contributes a
reliable foundation for demographic inference in fairness auditing.
☆ Conan: Progressive Learning to Reason Like a Detective over Multi-Scale Visual Evidence
Video reasoning, which requires multi-step deduction across frames, remains a
major challenge for multimodal large language models (MLLMs). While
reinforcement learning (RL)-based methods enhance reasoning capabilities, they
often rely on text-only chains that yield ungrounded or hallucinated
conclusions. Conversely, frame-retrieval approaches introduce visual grounding
but still struggle with inaccurate evidence localization. To address these
challenges, we present Conan, a framework for evidence-grounded multi-step
video reasoning. Conan identifies contextual and evidence frames, reasons over
cross-frame clues, and adaptively decides when to conclude or explore further.
To achieve this, we (1) construct Conan-91K, a large-scale dataset of
automatically generated reasoning traces that includes frame identification,
evidence reasoning, and action decision, and (2) design a multi-stage
progressive cold-start strategy combined with an
Identification-Reasoning-Action (AIR) RLVR training framework to jointly
enhance multi-step visual reasoning. Extensive experiments on six multi-step
reasoning benchmarks demonstrate that Conan surpasses the baseline
Qwen2.5-VL-7B-Instruct by an average of over 10% in accuracy, achieving
state-of-the-art performance. Furthermore, Conan generalizes effectively to
long-video understanding tasks, validating its strong scalability and
robustness.
☆ Transferable Black-Box One-Shot Forging of Watermarks via Image Preference Models NeurIPS 2025
Tomáš Souček, Sylvestre-Alvise Rebuffi, Pierre Fernandez, Nikola Jovanović, Hady Elsahar, Valeriu Lacatusu, Tuan Tran, Alexandre Mourachko
Recent years have seen a surge in interest in digital content watermarking
techniques, driven by the proliferation of generative models and increased
legal pressure. With an ever-growing percentage of AI-generated content
available online, watermarking plays an increasingly important role in ensuring
content authenticity and attribution at scale. There have been many works
assessing the robustness of watermarking to removal attacks, yet, watermark
forging, the scenario when a watermark is stolen from genuine content and
applied to malicious content, remains underexplored. In this work, we
investigate watermark forging in the context of widely used post-hoc image
watermarking. Our contributions are as follows. First, we introduce a
preference model to assess whether an image is watermarked. The model is
trained using a ranking loss on purely procedurally generated images without
any need for real watermarks. Second, we demonstrate the model's capability to
remove and forge watermarks by optimizing the input image through
backpropagation. This technique requires only a single watermarked image and
works without knowledge of the watermarking model, making our attack much
simpler and more practical than attacks introduced in related work. Third, we
evaluate our proposed method on a variety of post-hoc image watermarking
models, demonstrating that our approach can effectively forge watermarks,
questioning the security of current watermarking approaches. Our code and
further resources are publicly available.
comment: NeurIPS 2025
☆ Dynamic Weight Adjustment for Knowledge Distillation: Leveraging Vision Transformer for High-Accuracy Lung Cancer Detection and Real-Time Deployment
This paper presents the FuzzyDistillViT-MobileNet model, a novel approach for
lung cancer (LC) classification, leveraging dynamic fuzzy logic-driven
knowledge distillation (KD) to address uncertainty and complexity in disease
diagnosis. Unlike traditional models that rely on static KD with fixed weights,
our method dynamically adjusts the distillation weight using fuzzy logic,
enabling the student model to focus on high-confidence regions while reducing
attention to ambiguous areas. This dynamic adjustment improves the model
ability to handle varying uncertainty levels across different regions of LC
images. We employ the Vision Transformer (ViT-B32) as the instructor model,
which effectively transfers knowledge to the student model, MobileNet,
enhancing the student generalization capabilities. The training process is
further optimized using a dynamic wait adjustment mechanism that adapts the
training procedure for improved convergence and performance. To enhance image
quality, we introduce pixel-level image fusion improvement techniques such as
Gamma correction and Histogram Equalization. The processed images (Pix1 and
Pix2) are fused using a wavelet-based fusion method to improve image resolution
and feature preservation. This fusion method uses the wavedec2 function to
standardize images to a 224x224 resolution, decompose them into multi-scale
frequency components, and recursively average coefficients at each level for
better feature representation. To address computational efficiency, Genetic
Algorithm (GA) is used to select the most suitable pre-trained student model
from a pool of 12 candidates, balancing model performance with computational
cost. The model is evaluated on two datasets, including LC25000
histopathological images (99.16% accuracy) and IQOTH/NCCD CT-scan images
(99.54% accuracy), demonstrating robustness across different imaging domains.
☆ Mitigating Cross-modal Representation Bias for Multicultural Image-to-Recipe Retrieval
Existing approaches for image-to-recipe retrieval have the implicit
assumption that a food image can fully capture the details textually documented
in its recipe. However, a food image only reflects the visual outcome of a
cooked dish and not the underlying cooking process. Consequently, learning
cross-modal representations to bridge the modality gap between images and
recipes tends to ignore subtle, recipe-specific details that are not visually
apparent but are crucial for recipe retrieval. Specifically, the
representations are biased to capture the dominant visual elements, resulting
in difficulty in ranking similar recipes with subtle differences in use of
ingredients and cooking methods. The bias in representation learning is
expected to be more severe when the training data is mixed of images and
recipes sourced from different cuisines. This paper proposes a novel causal
approach that predicts the culinary elements potentially overlooked in images,
while explicitly injecting these elements into cross-modal representation
learning to mitigate biases. Experiments are conducted on the standard
monolingual Recipe1M dataset and a newly curated multilingual multicultural
cuisine dataset. The results indicate that the proposed causal representation
learning is capable of uncovering subtle ingredients and cooking actions and
achieves impressive retrieval performance on both monolingual and multilingual
multicultural datasets.
comment: ACM Multimedia 2025
☆ Positional Encoding Field
Diffusion Transformers (DiTs) have emerged as the dominant architecture for
visual generation, powering state-of-the-art image and video models. By
representing images as patch tokens with positional encodings (PEs), DiTs
combine Transformer scalability with spatial and temporal inductive biases. In
this work, we revisit how DiTs organize visual content and discover that patch
tokens exhibit a surprising degree of independence: even when PEs are
perturbed, DiTs still produce globally coherent outputs, indicating that
spatial coherence is primarily governed by PEs. Motivated by this finding, we
introduce the Positional Encoding Field (PE-Field), which extends positional
encodings from the 2D plane to a structured 3D field. PE-Field incorporates
depth-aware encodings for volumetric reasoning and hierarchical encodings for
fine-grained sub-patch control, enabling DiTs to model geometry directly in 3D
space. Our PE-Field-augmented DiT achieves state-of-the-art performance on
single-image novel view synthesis and generalizes to controllable spatial image
editing.
comment: 8 pages, 9 figures
☆ Synthetic Data for Robust Runway Detection
Deep vision models are now mature enough to be integrated in industrial and
possibly critical applications such as autonomous navigation. Yet, data
collection and labeling to train such models requires too much efforts and
costs for a single company or product. This drawback is more significant in
critical applications, where training data must include all possible conditions
including rare scenarios. In this perspective, generating synthetic images is
an appealing solution, since it allows a cheap yet reliable covering of all the
conditions and environments, if the impact of the synthetic-to-real
distribution shift is mitigated. In this article, we consider the case of
runway detection that is a critical part in autonomous landing systems
developed by aircraft manufacturers. We propose an image generation approach
based on a commercial flight simulator that complements a few annotated real
images. By controlling the image generation and the integration of real and
synthetic data, we show that standard object detection models can achieve
accurate prediction. We also evaluate their robustness with respect to adverse
conditions, in our case nighttime images, that were not represented in the real
data, and show the interest of using a customized domain adaptation strategy.
☆ AccuQuant: Simulating Multiple Denoising Steps for Quantizing Diffusion Models NeurIPS 2025
We present in this paper a novel post-training quantization (PTQ) method,
dubbed AccuQuant, for diffusion models. We show analytically and empirically
that quantization errors for diffusion models are accumulated over denoising
steps in a sampling process. To alleviate the error accumulation problem,
AccuQuant minimizes the discrepancies between outputs of a full-precision
diffusion model and its quantized version within a couple of denoising steps.
That is, it simulates multiple denoising steps of a diffusion sampling process
explicitly for quantization, accounting the accumulated errors over multiple
denoising steps, which is in contrast to previous approaches to imitating a
training process of diffusion models, namely, minimizing the discrepancies
independently for each step. We also present an efficient implementation
technique for AccuQuant, together with a novel objective, which reduces a
memory complexity significantly from $\mathcal{O}(n)$ to $\mathcal{O}(1)$,
where $n$ is the number of denoising steps. We demonstrate the efficacy and
efficiency of AccuQuant across various tasks and diffusion models on standard
benchmarks.
comment: Accepted to NeurIPS 2025
☆ Dino-Diffusion Modular Designs Bridge the Cross-Domain Gap in Autonomous Parking
Parking is a critical pillar of driving safety. While recent end-to-end (E2E)
approaches have achieved promising in-domain results, robustness under domain
shifts (e.g., weather and lighting changes) remains a key challenge. Rather
than relying on additional data, in this paper, we propose Dino-Diffusion
Parking (DDP), a domain-agnostic autonomous parking pipeline that integrates
visual foundation models with diffusion-based planning to enable generalized
perception and robust motion planning under distribution shifts. We train our
pipeline in CARLA at regular setting and transfer it to more adversarial
settings in a zero-shot fashion. Our model consistently achieves a parking
success rate above 90% across all tested out-of-distribution (OOD) scenarios,
with ablation studies confirming that both the network architecture and
algorithmic design significantly enhance cross-domain performance over existing
baselines. Furthermore, testing in a 3D Gaussian splatting (3DGS) environment
reconstructed from a real-world parking lot demonstrates promising sim-to-real
transfer.
comment: Code is at
https://github.com/ChampagneAndfragrance/Dino_Diffusion_Parking_Official
☆ AnyPcc: Compressing Any Point Cloud with a Single Universal Model
Generalization remains a critical challenge for deep learning-based point
cloud geometry compression. We argue this stems from two key limitations: the
lack of robust context models and the inefficient handling of
out-of-distribution (OOD) data. To address both, we introduce AnyPcc, a
universal point cloud compression framework. AnyPcc first employs a Universal
Context Model that leverages priors from both spatial and channel-wise grouping
to capture robust contextual dependencies. Second, our novel Instance-Adaptive
Fine-Tuning (IAFT) strategy tackles OOD data by synergizing explicit and
implicit compression paradigms. It fine-tunes a small subset of network weights
for each instance and incorporates them into the bitstream, where the marginal
bit cost of the weights is dwarfed by the resulting savings in geometry
compression. Extensive experiments on a benchmark of 15 diverse datasets
confirm that AnyPcc sets a new state-of-the-art in point cloud compression. Our
code and datasets will be released to encourage reproducible research.
comment: 11 pages, 5 figures
☆ HyperET: Efficient Training in Hyperbolic Space for Multi-modal Large Language Models NeurIPS2025
Multi-modal large language models (MLLMs) have emerged as a transformative
approach for aligning visual and textual understanding. They typically require
extremely high computational resources (e.g., thousands of GPUs) for training
to achieve cross-modal alignment at multi-granularity levels. We argue that a
key source of this inefficiency lies in the vision encoders they widely equip
with, e.g., CLIP and SAM, which lack the alignment with language at
multi-granularity levels. To address this issue, in this paper, we leverage
hyperbolic space, which inherently models hierarchical levels and thus provides
a principled framework for bridging the granularity gap between visual and
textual modalities at an arbitrary granularity level. Concretely, we propose an
efficient training paradigm for MLLMs, dubbed as HyperET, which can optimize
visual representations to align with their textual counterparts at an arbitrary
granularity level through dynamic hyperbolic radius adjustment in hyperbolic
space. HyperET employs learnable matrices with M\"{o}bius multiplication
operations, implemented via three effective configurations: diagonal scaling
matrices, block-diagonal matrices, and banded matrices, providing a flexible
yet efficient parametrization strategy. Comprehensive experiments across
multiple MLLM benchmarks demonstrate that HyperET consistently improves both
existing pre-training and fine-tuning MLLMs clearly with less than 1\%
additional parameters.
comment: Accepted by NeurIPS2025
☆ A Parameter-Efficient Mixture-of-Experts Framework for Cross-Modal Geo-Localization
LinFeng Li, Jian Zhao, Zepeng Yang, Yuhang Song, Bojun Lin, Tianle Zhang, Yuchen Yuan, Chi Zhang, Xuelong Li
We present a winning solution to RoboSense 2025 Track 4: Cross-Modal Drone
Navigation. The task retrieves the most relevant geo-referenced image from a
large multi-platform corpus (satellite/drone/ground) given a natural-language
query. Two obstacles are severe inter-platform heterogeneity and a domain gap
between generic training descriptions and platform-specific test queries. We
mitigate these with a domain-aligned preprocessing pipeline and a
Mixture-of-Experts (MoE) framework: (i) platform-wise partitioning, satellite
augmentation, and removal of orientation words; (ii) an LLM-based caption
refinement pipeline to align textual semantics with the distinct visual
characteristics of each platform. Using BGE-M3 (text) and EVA-CLIP (image), we
train three platform experts using a progressive two-stage, hard-negative
mining strategy to enhance discriminative power, and fuse their scores at
inference. The system tops the official leaderboard, demonstrating robust
cross-modal geo-localization under heterogeneous viewpoints.
☆ Breakdance Video classification in the age of Generative AI
Large Vision Language models have seen huge application in several sports
use-cases recently. Most of these works have been targeted towards a limited
subset of popular sports like soccer, cricket, basketball etc; focusing on
generative tasks like visual question answering, highlight generation. This
work analyzes the applicability of the modern video foundation models (both
encoder and decoder) for a very niche but hugely popular dance sports -
breakdance. Our results show that Video Encoder models continue to outperform
state-of-the-art Video Language Models for prediction tasks. We provide
insights on how to choose the encoder model and provide a thorough analysis
into the workings of a finetuned decoder model for breakdance video
classification.
comment: 11 pages
☆ UI-Ins: Enhancing GUI Grounding with Multi-Perspective Instruction-as-Reasoning
Liangyu Chen, Hanzhang Zhou, Chenglin Cai, Jianan Zhang, Panrong Tong, Quyu Kong, Xu Zhang, Chen Liu, Yuqi Liu, Wenxuan Wang, Yue Wang, Qin Jin, Steven Hoi
GUI grounding, which maps natural-language instructions to actionable UI
elements, is a core capability of GUI agents. Prior works largely treats
instructions as a static proxy for user intent, overlooking the impact of
instruction diversity and quality on grounding performance. Through a careful
investigation of existing grounding datasets, we find a 23.3% flaw rate in
their instructions and show that inference-time exploitation of instruction
diversity yields up to a substantial 76% relative performance improvement. In
this paper, we introduce the Instruction-as-Reasoning paradigm, treating
instructions as dynamic analytical pathways that offer distinct perspectives
and enabling the model to select the most effective pathway during reasoning.
To achieve this, we propose a two-stage training framework: supervised
fine-tuning (SFT) on synthesized, diverse instructions to instill
multi-perspective reasoning, followed by reinforcement learning (RL) to
optimize pathway selection and composition. Our resulting models, UI-Ins-7B and
UI-Ins-32B, achieve state-of-the-art results on five challenging grounding
benchmarks and exhibit emergent reasoning, selectively composing and
synthesizing novel instruction pathways at inference. In particular, UI-Ins-32B
attains the best grounding accuracy, scoring 87.3% on UI-I2E-Bench, 57.0% on
ScreenSpot-Pro, and 84.9% on MMBench-GUI L2. Furthermore, our model
demonstrates strong agentic potential, achieving a 74.1% success rate on
AndroidWorld using UI-Ins-7B as the executor. Our in-depth analysis reveals
additional insights such as how reasoning can be formulated to enhance rather
than hinder grounding performance, and how our method mitigates policy collapse
in the SFT+RL framework. All code and model checkpoints will be publicly
released in https://github.com/alibaba/UI-Ins.
☆ DMC$^3$: Dual-Modal Counterfactual Contrastive Construction for Egocentric Video Question Answering
Egocentric Video Question Answering (Egocentric VideoQA) plays an important
role in egocentric video understanding, which refers to answering questions
based on first-person videos. Although existing methods have made progress
through the paradigm of pre-training and fine-tuning, they ignore the unique
challenges posed by the first-person perspective, such as understanding
multiple events and recognizing hand-object interactions. To deal with these
challenges, we propose a Dual-Modal Counterfactual Contrastive Construction
(DMC$^3$) framework, which contains an egocentric videoqa baseline, a
counterfactual sample construction module and a counterfactual sample-involved
contrastive optimization. Specifically, We first develop a counterfactual
sample construction module to generate positive and negative samples for
textual and visual modalities through event description paraphrasing and core
interaction mining, respectively. Then, We feed these samples together with the
original samples into the baseline. Finally, in the counterfactual
sample-involved contrastive optimization module, we apply contrastive loss to
minimize the distance between the original sample features and the positive
sample features, while maximizing the distance from the negative samples.
Experiments show that our method achieve 52.51\% and 46.04\% on the
\textit{normal} and \textit{indirect} splits of EgoTaskQA, and 13.2\% on
QAEGO4D, both reaching the state-of-the-art performance.
☆ Knowledge-Informed Neural Network for Complex-Valued SAR Image Recognition
Deep learning models for complex-valued Synthetic Aperture Radar (CV-SAR)
image recognition are fundamentally constrained by a representation trilemma
under data-limited and domain-shift scenarios: the concurrent, yet conflicting,
optimization of generalization, interpretability, and efficiency. Our work is
motivated by the premise that the rich electromagnetic scattering features
inherent in CV-SAR data hold the key to resolving this trilemma, yet they are
insufficiently harnessed by conventional data-driven models. To this end, we
introduce the Knowledge-Informed Neural Network (KINN), a lightweight framework
built upon a novel "compression-aggregation-compression" architecture. The
first stage performs a physics-guided compression, wherein a novel dictionary
processor adaptively embeds physical priors, enabling a compact unfolding
network to efficiently extract sparse, physically-grounded signatures. A
subsequent aggregation module enriches these representations, followed by a
final semantic compression stage that utilizes a compact classification head
with self-distillation to learn maximally task-relevant and discriminative
embeddings. We instantiate KINN in both CNN (0.7M) and Vision Transformer
(0.95M) variants. Extensive evaluations on five SAR benchmarks confirm that
KINN establishes a state-of-the-art in parameter-efficient recognition,
offering exceptional generalization in data-scarce and out-of-distribution
scenarios and tangible interpretability, thereby providing an effective
solution to the representation trilemma and offering a new path for trustworthy
AI in SAR image analysis.
☆ Causal Debiasing for Visual Commonsense Reasoning
Visual Commonsense Reasoning (VCR) refers to answering questions and
providing explanations based on images. While existing methods achieve high
prediction accuracy, they often overlook bias in datasets and lack debiasing
strategies. In this paper, our analysis reveals co-occurrence and statistical
biases in both textual and visual data. We introduce the VCR-OOD datasets,
comprising VCR-OOD-QA and VCR-OOD-VA subsets, which are designed to evaluate
the generalization capabilities of models across two modalities. Furthermore,
we analyze the causal graphs and prediction shortcuts in VCR and adopt a
backdoor adjustment method to remove bias. Specifically, we create a dictionary
based on the set of correct answers to eliminate prediction shortcuts.
Experiments demonstrate the effectiveness of our debiasing method across
different datasets.
☆ GMFVAD: Using Grained Multi-modal Feature to Improve Video Anomaly Detection
Video anomaly detection (VAD) is a challenging task that detects anomalous
frames in continuous surveillance videos. Most previous work utilizes the
spatio-temporal correlation of visual features to distinguish whether there are
abnormalities in video snippets. Recently, some works attempt to introduce
multi-modal information, like text feature, to enhance the results of video
anomaly detection. However, these works merely incorporate text features into
video snippets in a coarse manner, overlooking the significant amount of
redundant information that may exist within the video snippets. Therefore, we
propose to leverage the diversity among multi-modal information to further
refine the extracted features, reducing the redundancy in visual features, and
we propose Grained Multi-modal Feature for Video Anomaly Detection (GMFVAD).
Specifically, we generate more grained multi-modal feature based on the video
snippet, which summarizes the main content, and text features based on the
captions of original video will be introduced to further enhance the visual
features of highlighted portions. Experiments show that the proposed GMFVAD
achieves state-of-the-art performance on four mainly datasets. Ablation
experiments also validate that the improvement of GMFVAD is due to the
reduction of redundant information.
☆ Real-Time Currency Detection and Voice Feedback for Visually Impaired Individuals
Technologies like smartphones have become an essential in our daily lives. It
has made accessible to everyone including visually impaired individuals. With
the use of smartphone cameras, image capturing and processing have become more
convenient. With the use of smartphones and machine learning, the life of
visually impaired can be made a little easier. Daily tasks such as handling
money without relying on someone can be troublesome for them. For that purpose
this paper presents a real-time currency detection system designed to assist
visually impaired individuals. The proposed model is trained on a dataset
containing 30 classes of notes and coins, representing 3 types of currency: US
dollar (USD), Euro (EUR), and Bangladeshi taka (BDT). Our approach uses a
YOLOv8 nano model with a custom detection head featuring deep convolutional
layers and Squeeze-and-Excitation blocks to enhance feature extraction and
detection accuracy. Our model has achieved a higher accuracy of 97.73%, recall
of 95.23%, f1-score of 95.85% and a mean Average Precision at IoU=0.5
(mAP50(B)) of 97.21\%. Using the voice feedback after the detection would help
the visually impaired to identify the currency. This paper aims to create a
practical and efficient currency detection system to empower visually impaired
individuals independent in handling money.
comment: 20 pages, 5 tables, 8 figues
☆ GUSL-Dehaze: A Green U-Shaped Learning Approach to Image Dehazing
Image dehazing is a restoration task that aims to recover a clear image from
a single hazy input. Traditional approaches rely on statistical priors and the
physics-based atmospheric scattering model to reconstruct the haze-free image.
While recent state-of-the-art methods are predominantly based on deep learning
architectures, these models often involve high computational costs and large
parameter sizes, making them unsuitable for resource-constrained devices. In
this work, we propose GUSL-Dehaze, a Green U-Shaped Learning approach to image
dehazing. Our method integrates a physics-based model with a green learning
(GL) framework, offering a lightweight, transparent alternative to conventional
deep learning techniques. Unlike neural network-based solutions, GUSL-Dehaze
completely avoids deep learning. Instead, we begin with an initial dehazing
step using a modified Dark Channel Prior (DCP), which is followed by a green
learning pipeline implemented through a U-shaped architecture. This
architecture employs unsupervised representation learning for effective feature
extraction, together with feature-engineering techniques such as the Relevant
Feature Test (RFT) and the Least-Squares Normal Transform (LNT) to maintain a
compact model size. Finally, the dehazed image is obtained via a transparent
supervised learning strategy. GUSL-Dehaze significantly reduces parameter count
while ensuring mathematical interpretability and achieving performance on par
with state-of-the-art deep learning models.
☆ Kinaema: a recurrent sequence model for memory and pose in motion
One key aspect of spatially aware robots is the ability to "find their
bearings", ie. to correctly situate themselves in previously seen spaces. In
this work, we focus on this particular scenario of continuous robotics
operations, where information observed before an actual episode start is
exploited to optimize efficiency. We introduce a new model, Kinaema, and agent,
capable of integrating a stream of visual observations while moving in a
potentially large scene, and upon request, processing a query image and
predicting the relative position of the shown space with respect to its current
position. Our model does not explicitly store an observation history, therefore
does not have hard constraints on context length. It maintains an implicit
latent memory, which is updated by a transformer in a recurrent way,
compressing the history of sensor readings into a compact representation. We
evaluate the impact of this model in a new downstream task we call "Mem-Nav".
We show that our large-capacity recurrent model maintains a useful
representation of the scene, navigates to goals observed before the actual
episode start, and is computationally efficient, in particular compared to
classical transformers with attention over an observation history.
comment: 10 pages + references + checklist + appendix, 29 pages total
☆ Calibrating Multimodal Consensus for Emotion Recognition
In recent years, Multimodal Emotion Recognition (MER) has made substantial
progress. Nevertheless, most existing approaches neglect the semantic
inconsistencies that may arise across modalities, such as conflicting emotional
cues between text and visual inputs. Besides, current methods are often
dominated by the text modality due to its strong representational capacity,
which can compromise recognition accuracy. To address these challenges, we
propose a model termed Calibrated Multimodal Consensus (CMC). CMC introduces a
Pseudo Label Generation Module (PLGM) to produce pseudo unimodal labels,
enabling unimodal pretraining in a self-supervised fashion. It then employs a
Parameter-free Fusion Module (PFM) and a Multimodal Consensus Router (MCR) for
multimodal finetuning, thereby mitigating text dominance and guiding the fusion
process toward a more reliable consensus. Experimental results demonstrate that
CMC achieves performance on par with or superior to state-of-the-art methods
across four datasets, CH-SIMS, CH-SIMS v2, CMU-MOSI, and CMU-MOSEI, and
exhibits notable advantages in scenarios with semantic inconsistencies on
CH-SIMS and CH-SIMS v2. The implementation of this work is publicly accessible
at https://github.com/gw-zhong/CMC.
☆ Seeing the Unseen: Mask-Driven Positional Encoding and Strip-Convolution Context Modeling for Cross-View Object Geo-Localization
Cross-view object geo-localization enables high-precision object localization
through cross-view matching, with critical applications in autonomous driving,
urban management, and disaster response. However, existing methods rely on
keypoint-based positional encoding, which captures only 2D coordinates while
neglecting object shape information, resulting in sensitivity to annotation
shifts and limited cross-view matching capability. To address these
limitations, we propose a mask-based positional encoding scheme that leverages
segmentation masks to capture both spatial coordinates and object silhouettes,
thereby upgrading the model from "location-aware" to "object-aware."
Furthermore, to tackle the challenge of large-span objects (e.g., elongated
buildings) in satellite imagery, we design a context enhancement module. This
module employs horizontal and vertical strip convolutional kernels to extract
long-range contextual features, enhancing feature discrimination among
strip-like objects. Integrating MPE and CEM, we present EDGeo, an end-to-end
framework for robust cross-view object geo-localization. Extensive experiments
on two public datasets (CVOGL and VIGOR-Building) demonstrate that our method
achieves state-of-the-art performance, with a 3.39% improvement in localization
accuracy under challenging ground-to-satellite scenarios. This work provides a
robust positional encoding paradigm and a contextual modeling framework for
advancing cross-view geo-localization research.
☆ Empower Words: DualGround for Structured Phrase and Sentence-Level Temporal Grounding NeurIPS 2025
Video Temporal Grounding (VTG) aims to localize temporal segments in long,
untrimmed videos that align with a given natural language query. This task
typically comprises two subtasks: Moment Retrieval (MR) and Highlight Detection
(HD). While recent advances have been progressed by powerful pretrained
vision-language models such as CLIP and InternVideo2, existing approaches
commonly treat all text tokens uniformly during crossmodal attention,
disregarding their distinct semantic roles. To validate the limitations of this
approach, we conduct controlled experiments demonstrating that VTG models
overly rely on [EOS]-driven global semantics while failing to effectively
utilize word-level signals, which limits their ability to achieve fine-grained
temporal alignment. Motivated by this limitation, we propose DualGround, a
dual-branch architecture that explicitly separates global and local semantics
by routing the [EOS] token through a sentence-level path and clustering word
tokens into phrase-level units for localized grounding. Our method introduces
(1) tokenrole- aware cross modal interaction strategies that align video
features with sentence-level and phrase-level semantics in a structurally
disentangled manner, and (2) a joint modeling framework that not only improves
global sentence-level alignment but also enhances finegrained temporal
grounding by leveraging structured phrase-aware context. This design allows the
model to capture both coarse and localized semantics, enabling more expressive
and context-aware video grounding. DualGround achieves state-of-the-art
performance on both Moment Retrieval and Highlight Detection tasks across
QVHighlights and Charades- STA benchmarks, demonstrating the effectiveness of
disentangled semantic modeling in video-language alignment.
comment: Comments: 28 pages, including appendix. 5 figures. Full version of
the NeurIPS 2025 paper
☆ COS3D: Collaborative Open-Vocabulary 3D Segmentation NeurIPS 2025
Runsong Zhu, Ka-Hei Hui, Zhengzhe Liu, Qianyi Wu, Weiliang Tang, Shi Qiu, Pheng-Ann Heng, Chi-Wing Fu
Open-vocabulary 3D segmentation is a fundamental yet challenging task,
requiring a mutual understanding of both segmentation and language. However,
existing Gaussian-splatting-based methods rely either on a single 3D language
field, leading to inferior segmentation, or on pre-computed class-agnostic
segmentations, suffering from error accumulation. To address these limitations,
we present COS3D, a new collaborative prompt-segmentation framework that
contributes to effectively integrating complementary language and segmentation
cues throughout its entire pipeline. We first introduce the new concept of
collaborative field, comprising an instance field and a language field, as the
cornerstone for collaboration. During training, to effectively construct the
collaborative field, our key idea is to capture the intrinsic relationship
between the instance field and language field, through a novel
instance-to-language feature mapping and designing an efficient two-stage
training strategy. During inference, to bridge distinct characteristics of the
two fields, we further design an adaptive language-to-instance prompt
refinement, promoting high-quality prompt-segmentation inference. Extensive
experiments not only demonstrate COS3D's leading performance over existing
methods on two widely-used benchmarks but also show its high potential to
various applications,~\ie, novel image-based 3D segmentation, hierarchical
segmentation, and robotics. The code is publicly available at
\href{https://github.com/Runsong123/COS3D}{https://github.com/Runsong123/COS3D}.
comment: NeurIPS 2025. The code is publicly available at
\href{https://github.com/Runsong123/COS3D}{https://github.com/Runsong123/COS3D}
☆ Why LVLMs Are More Prone to Hallucinations in Longer Responses: The Role of Context
Large Vision-Language Models (LVLMs) have made significant progress in recent
years but are also prone to hallucination issues. They exhibit more
hallucinations in longer, free-form responses, often attributed to accumulated
uncertainties. In this paper, we ask: Does increased hallucination result
solely from length-induced errors, or is there a deeper underlying mechanism?
After a series of preliminary experiments and findings, we suggest that the
risk of hallucinations is not caused by length itself but by the increased
reliance on context for coherence and completeness in longer responses.
Building on these insights, we propose a novel "induce-detect-suppress"
framework that actively induces hallucinations through deliberately designed
contexts, leverages induced instances for early detection of high-risk cases,
and ultimately suppresses potential object-level hallucinations during actual
decoding. Our approach achieves consistent, significant improvements across all
benchmarks, demonstrating its efficacy. The strong detection and improved
hallucination mitigation not only validate our framework but, more importantly,
re-validate our hypothesis on context. Rather than solely pursuing performance
gains, this study aims to provide new insights and serves as a first step
toward a deeper exploration of hallucinations in LVLMs' longer responses.
☆ EditInfinity: Image Editing with Binary-Quantized Generative Models NeurIPS 2025
Adapting pretrained diffusion-based generative models for text-driven image
editing with negligible tuning overhead has demonstrated remarkable potential.
A classical adaptation paradigm, as followed by these methods, first infers the
generative trajectory inversely for a given source image by image inversion,
then performs image editing along the inferred trajectory guided by the target
text prompts. However, the performance of image editing is heavily limited by
the approximation errors introduced during image inversion by diffusion models,
which arise from the absence of exact supervision in the intermediate
generative steps. To circumvent this issue, we investigate the
parameter-efficient adaptation of VQ-based generative models for image editing,
and leverage their inherent characteristic that the exact intermediate
quantized representations of a source image are attainable, enabling more
effective supervision for precise image inversion. Specifically, we propose
\emph{EditInfinity}, which adapts \emph{Infinity}, a binary-quantized
generative model, for image editing. We propose an efficient yet effective
image inversion mechanism that integrates text prompting rectification and
image style preservation, enabling precise image inversion. Furthermore, we
devise a holistic smoothing strategy which allows our \emph{EditInfinity} to
perform image editing with high fidelity to source images and precise semantic
alignment to the text prompts. Extensive experiments on the PIE-Bench benchmark
across "add", "change", and "delete" editing operations, demonstrate the
superior performance of our model compared to state-of-the-art diffusion-based
baselines. Code available at: https://github.com/yx-chen-ust/EditInfinity.
comment: 28 pages, 13 figures, accepted by The Thirty-ninth Annual Conference
on Neural Information Processing Systems (NeurIPS 2025)
☆ Towards Objective Obstetric Ultrasound Assessment: Contrastive Representation Learning for Fetal Movement Detection
Talha Ilyas, Duong Nhu, Allison Thomas, Arie Levin, Lim Wei Yap, Shu Gong, David Vera Anaya, Yiwen Jiang, Deval Mehta, Ritesh Warty, Vinayak Smith, Maya Reddy, Euan Wallace, Wenlong Cheng, Zongyuan Ge, Faezeh Marzbanrad
Accurate fetal movement (FM) detection is essential for assessing prenatal
health, as abnormal movement patterns can indicate underlying complications
such as placental dysfunction or fetal distress. Traditional methods, including
maternal perception and cardiotocography (CTG), suffer from subjectivity and
limited accuracy. To address these challenges, we propose Contrastive
Ultrasound Video Representation Learning (CURL), a novel self-supervised
learning framework for FM detection from extended fetal ultrasound video
recordings. Our approach leverages a dual-contrastive loss, incorporating both
spatial and temporal contrastive learning, to learn robust motion
representations. Additionally, we introduce a task-specific sampling strategy,
ensuring the effective separation of movement and non-movement segments during
self-supervised training, while enabling flexible inference on arbitrarily long
ultrasound recordings through a probabilistic fine-tuning approach. Evaluated
on an in-house dataset of 92 subjects, each with 30-minute ultrasound sessions,
CURL achieves a sensitivity of 78.01% and an AUROC of 81.60%, demonstrating its
potential for reliable and objective FM analysis. These results highlight the
potential of self-supervised contrastive learning for fetal movement analysis,
paving the way for improved prenatal monitoring and clinical decision-making.
comment: This is the preprint version of the manuscript submitted to IEEE
Journal of Biomedical and Health Informatics (JBHI) for review
☆ FlowCycle: Pursuing Cycle-Consistent Flows for Text-based Editing
Recent advances in pre-trained text-to-image flow models have enabled
remarkable progress in text-based image editing. Mainstream approaches always
adopt a corruption-then-restoration paradigm, where the source image is first
corrupted into an ``intermediate state'' and then restored to the target image
under the prompt guidance. However, current methods construct this intermediate
state in a target-agnostic manner, i.e., they primarily focus on realizing
source image reconstruction while neglecting the semantic gaps towards the
specific editing target. This design inherently results in limited editability
or inconsistency when the desired modifications substantially deviate from the
source. In this paper, we argue that the intermediate state should be
target-aware, i.e., selectively corrupting editing-relevant contents while
preserving editing-irrelevant ones. To this end, we propose FlowCycle, a novel
inversion-free and flow-based editing framework that parameterizes corruption
with learnable noises and optimizes them through a cycle-consistent process. By
iteratively editing the source to the target and recovering back to the source
with dual consistency constraints, FlowCycle learns to produce a target-aware
intermediate state, enabling faithful modifications while preserving source
consistency. Extensive ablations have demonstrated that FlowCycle achieves
superior editing quality and consistency over state-of-the-art methods.
☆ RAPO++: Cross-Stage Prompt Optimization for Text-to-Video Generation via Data Alignment and Test-Time Scaling
Bingjie Gao, Qianli Ma, Xiaoxue Wu, Shuai Yang, Guanzhou Lan, Haonan Zhao, Jiaxuan Chen, Qingyang Liu, Yu Qiao, Xinyuan Chen, Yaohui Wang, Li Niu
Prompt design plays a crucial role in text-to-video (T2V) generation, yet
user-provided prompts are often short, unstructured, and misaligned with
training data, limiting the generative potential of diffusion-based T2V models.
We present \textbf{RAPO++}, a cross-stage prompt optimization framework that
unifies training-data--aligned refinement, test-time iterative scaling, and
large language model (LLM) fine-tuning to substantially improve T2V generation
without modifying the underlying generative backbone. In \textbf{Stage 1},
Retrieval-Augmented Prompt Optimization (RAPO) enriches user prompts with
semantically relevant modifiers retrieved from a relation graph and refactors
them to match training distributions, enhancing compositionality and
multi-object fidelity. \textbf{Stage 2} introduces Sample-Specific Prompt
Optimization (SSPO), a closed-loop mechanism that iteratively refines prompts
using multi-source feedback -- including semantic alignment, spatial fidelity,
temporal coherence, and task-specific signals such as optical flow -- yielding
progressively improved video generation quality. \textbf{Stage 3} leverages
optimized prompt pairs from SSPO to fine-tune the rewriter LLM, internalizing
task-specific optimization patterns and enabling efficient, high-quality prompt
generation even before inference. Extensive experiments across five
state-of-the-art T2V models and five benchmarks demonstrate that RAPO++
achieves significant gains in semantic alignment, compositional reasoning,
temporal stability, and physical plausibility, outperforming existing methods
by large margins. Our results highlight RAPO++ as a model-agnostic,
cost-efficient, and scalable solution that sets a new standard for prompt
optimization in T2V generation. The code is available at
https://github.com/Vchitect/RAPO.
☆ A Structured Review and Quantitative Profiling of Public Brain MRI Datasets for Foundation Model Development
The development of foundation models for brain MRI depends critically on the
scale, diversity, and consistency of available data, yet systematic assessments
of these factors remain scarce. In this study, we analyze 54 publicly
accessible brain MRI datasets encompassing over 538,031 to provide a
structured, multi-level overview tailored to foundation model development. At
the dataset level, we characterize modality composition, disease coverage, and
dataset scale, revealing strong imbalances between large healthy cohorts and
smaller clinical populations. At the image level, we quantify voxel spacing,
orientation, and intensity distributions across 15 representative datasets,
demonstrating substantial heterogeneity that can influence representation
learning. We then perform a quantitative evaluation of preprocessing
variability, examining how intensity normalization, bias field correction,
skull stripping, spatial registration, and interpolation alter voxel statistics
and geometry. While these steps improve within-dataset consistency, residual
differences persist between datasets. Finally, feature-space case study using a
3D DenseNet121 shows measurable residual covariate shift after standardized
preprocessing, confirming that harmonization alone cannot eliminate
inter-dataset bias. Together, these analyses provide a unified characterization
of variability in public brain MRI resources and emphasize the need for
preprocessing-aware and domain-adaptive strategies in the design of
generalizable brain MRI foundation models.
☆ Multimedia-Aware Question Answering: A Review of Retrieval and Cross-Modal Reasoning Architectures
Question Answering (QA) systems have traditionally relied on structured text
data, but the rapid growth of multimedia content (images, audio, video, and
structured metadata) has introduced new challenges and opportunities for
retrieval-augmented QA. In this survey, we review recent advancements in QA
systems that integrate multimedia retrieval pipelines, focusing on
architectures that align vision, language, and audio modalities with user
queries. We categorize approaches based on retrieval methods, fusion
techniques, and answer generation strategies, and analyze benchmark datasets,
evaluation protocols, and performance tradeoffs. Furthermore, we highlight key
challenges such as cross-modal alignment, latency-accuracy tradeoffs, and
semantic grounding, and outline open problems and future research directions
for building more robust and context-aware QA systems leveraging multimedia
data.
comment: In Proceedings of the 2nd ACM Workshop in AI-powered Question and
Answering Systems (AIQAM '25), October 27-28, 2025, Dublin, Ireland. ACM, New
York, NY, USA, 8 pages. https://doi.org/10.1145/3746274.3760393
☆ SPAN: Continuous Modeling of Suspicion Progression for Temporal Intention Localization
Temporal Intention Localization (TIL) is crucial for video surveillance,
focusing on identifying varying levels of suspicious intentions to improve
security monitoring. However, existing discrete classification methods fail to
capture the continuous nature of suspicious intentions, limiting early
intervention and explainability. In this paper, we propose the Suspicion
Progression Analysis Network (SPAN), which shifts from discrete classification
to continuous regression, enabling the capture of fluctuating and evolving
suspicious intentions. We reveal that suspicion exhibits long-term dependencies
and cumulative effects, similar to Temporal Point Process (TPP) theory. Based
on these insights, we define a suspicion score formula that models continuous
changes while accounting for temporal characteristics. We also introduce
Suspicion Coefficient Modulation, which adjusts suspicion coefficients using
multimodal information to reflect the varying impacts of suspicious actions.
Additionally, the Concept-Anchored Mapping method is proposed to link
suspicious actions to predefined intention concepts, offering insights into
both the actions and their potential underlying intentions. Extensive
experiments on the HAI dataset show that SPAN significantly outperforms
existing methods, reducing MSE by 19.8% and improving average mAP by 1.78%.
Notably, SPAN achieves a 2.74% mAP gain in low-frequency cases, demonstrating
its superior ability to capture subtle behavioral changes. Compared to discrete
classification systems, our continuous suspicion modeling approach enables
earlier detection and proactive intervention, greatly enhancing system
explainability and practical utility in security applications.
☆ Evaluating Video Models as Simulators of Multi-Person Pedestrian Trajectories
Large-scale video generation models have demonstrated high visual realism in
diverse contexts, spurring interest in their potential as general-purpose world
simulators. Existing benchmarks focus on individual subjects rather than scenes
with multiple interacting people. However, the plausibility of multi-agent
dynamics in generated videos remains unverified. We propose a rigorous
evaluation protocol to benchmark text-to-video (T2V) and image-to-video (I2V)
models as implicit simulators of pedestrian dynamics. For I2V, we leverage
start frames from established datasets to enable comparison with a ground truth
video dataset. For T2V, we develop a prompt suite to explore diverse pedestrian
densities and interactions. A key component is a method to reconstruct 2D
bird's-eye view trajectories from pixel-space without known camera parameters.
Our analysis reveals that leading models have learned surprisingly effective
priors for plausible multi-agent behavior. However, failure modes like merging
and disappearing people highlight areas for future improvement.
comment: Preprint, under review
☆ PPMStereo: Pick-and-Play Memory Construction for Consistent Dynamic Stereo Matching
Temporally consistent depth estimation from stereo video is critical for
real-world applications such as augmented reality, where inconsistent depth
estimation disrupts the immersion of users. Despite its importance, this task
remains challenging due to the difficulty in modeling long-term temporal
consistency in a computationally efficient manner. Previous methods attempt to
address this by aggregating spatio-temporal information but face a fundamental
trade-off: limited temporal modeling provides only modest gains, whereas
capturing long-range dependencies significantly increases computational cost.
To address this limitation, we introduce a memory buffer for modeling
long-range spatio-temporal consistency while achieving efficient dynamic stereo
matching. Inspired by the two-stage decision-making process in humans, we
propose a \textbf{P}ick-and-\textbf{P}lay \textbf{M}emory (PPM) construction
module for dynamic \textbf{Stereo} matching, dubbed as \textbf{PPMStereo}. PPM
consists of a `pick' process that identifies the most relevant frames and a
`play' process that weights the selected frames adaptively for spatio-temporal
aggregation. This two-stage collaborative process maintains a compact yet
highly informative memory buffer while achieving temporally consistent
information aggregation. Extensive experiments validate the effectiveness of
PPMStereo, demonstrating state-of-the-art performance in both accuracy and
temporal consistency. % Notably, PPMStereo achieves 0.62/1.11 TEPE on the
Sintel clean/final (17.3\% \& 9.02\% improvements over BiDAStereo) with fewer
computational costs. Codes are available at
\textcolor{blue}{https://github.com/cocowy1/PPMStereo}.
☆ IB-GAN: Disentangled Representation Learning with Information Bottleneck Generative Adversarial Networks AAAI
We propose a new GAN-based unsupervised model for disentangled representation
learning. The new model is discovered in an attempt to utilize the Information
Bottleneck (IB) framework to the optimization of GAN, thereby named IB-GAN. The
architecture of IB-GAN is partially similar to that of InfoGAN but has a
critical difference; an intermediate layer of the generator is leveraged to
constrain the mutual information between the input and the generated output.
The intermediate stochastic layer can serve as a learnable latent distribution
that is trained with the generator jointly in an end-to-end fashion. As a
result, the generator of IB-GAN can harness the latent space in a disentangled
and interpretable manner. With the experiments on dSprites and Color-dSprites
dataset, we demonstrate that IB-GAN achieves competitive disentanglement scores
to those of state-of-the-art \b{eta}-VAEs and outperforms InfoGAN. Moreover,
the visual quality and the diversity of samples generated by IB-GAN are often
better than those by \b{eta}-VAEs and Info-GAN in terms of FID score on CelebA
and 3D Chairs dataset.
comment: Published in the Proceedings of the Thirty Fifth AAAI Conference on
Artificial Intelligence (AAAI 2021), paper number 7926
☆ TOMCAT: Test-time Comprehensive Knowledge Accumulation for Compositional Zero-Shot Learning NeurIPS 2025
Compositional Zero-Shot Learning (CZSL) aims to recognize novel
attribute-object compositions based on the knowledge learned from seen ones.
Existing methods suffer from performance degradation caused by the distribution
shift of label space at test time, which stems from the inclusion of unseen
compositions recombined from attributes and objects. To overcome the challenge,
we propose a novel approach that accumulates comprehensive knowledge in both
textual and visual modalities from unsupervised data to update multimodal
prototypes at test time. Building on this, we further design an adaptive update
weight to control the degree of prototype adjustment, enabling the model to
flexibly adapt to distribution shift during testing. Moreover, a dynamic
priority queue is introduced that stores high-confidence images to acquire
visual knowledge from historical images for inference. Considering the semantic
consistency of multimodal knowledge, we align textual and visual prototypes by
multimodal collaborative representation learning. Extensive experiments
indicate that our approach achieves state-of-the-art performance on four
benchmark datasets under both closed-world and open-world settings. Code will
be available at https://github.com/xud-yan/TOMCAT .
comment: Accepted to NeurIPS 2025
☆ Monocular Visual 8D Pose Estimation for Articulated Bicycles and Cyclists
In Autonomous Driving, cyclists belong to the safety-critical class of
Vulnerable Road Users (VRU), and accurate estimation of their pose is critical
for cyclist crossing intention classification, behavior prediction, and
collision avoidance. Unlike rigid objects, articulated bicycles are composed of
movable rigid parts linked by joints and constrained by a kinematic structure.
6D pose methods can estimate the 3D rotation and translation of rigid bicycles,
but 6D becomes insufficient when the steering/pedals angles of the bicycle
vary. That is because: 1) varying the articulated pose of the bicycle causes
its 3D bounding box to vary as well, and 2) the 3D box orientation is not
necessarily aligned to the orientation of the steering which determines the
actual intended travel direction. In this work, we introduce a method for
category-level 8D pose estimation for articulated bicycles and cyclists from a
single RGB image. Besides being able to estimate the 3D translation and
rotation of a bicycle from a single image, our method also estimates the
rotations of its steering handles and pedals with respect to the bicycle body
frame. These two new parameters enable the estimation of a more fine-grained
bicycle pose state and travel direction. Our proposed model jointly estimates
the 8D pose and the 3D Keypoints of articulated bicycles, and trains with a mix
of synthetic and real image data to generalize on real images. We include an
evaluation section where we evaluate the accuracy of our estimated 8D pose
parameters, and our method shows promising results by achieving competitive
scores when compared against state-of-the-art category-level 6D pose estimators
that use rigid canonical object templates for matching.
☆ PartNeXt: A Next-Generation Dataset for Fine-Grained and Hierarchical 3D Part Understanding NeurIPS 2025
Understanding objects at the level of their constituent parts is fundamental
to advancing computer vision, graphics, and robotics. While datasets like
PartNet have driven progress in 3D part understanding, their reliance on
untextured geometries and expert-dependent annotation limits scalability and
usability. We introduce PartNeXt, a next-generation dataset addressing these
gaps with over 23,000 high-quality, textured 3D models annotated with
fine-grained, hierarchical part labels across 50 categories. We benchmark
PartNeXt on two tasks: (1) class-agnostic part segmentation, where
state-of-the-art methods (e.g., PartField, SAMPart3D) struggle with
fine-grained and leaf-level parts, and (2) 3D part-centric question answering,
a new benchmark for 3D-LLMs that reveals significant gaps in open-vocabulary
part grounding. Additionally, training Point-SAM on PartNeXt yields substantial
gains over PartNet, underscoring the dataset's superior quality and diversity.
By combining scalable annotation, texture-aware labels, and multi-task
evaluation, PartNeXt opens new avenues for research in structured 3D
understanding.
comment: NeurIPS 2025 DB Track. Project page:
https://authoritywang.github.io/partnext
☆ Revisiting Logit Distributions for Reliable Out-of-Distribution Detection NeurIPS 2025
Out-of-distribution (OOD) detection is critical for ensuring the reliability
of deep learning models in open-world applications. While post-hoc methods are
favored for their efficiency and ease of deployment, existing approaches often
underexploit the rich information embedded in the model's logits space. In this
paper, we propose LogitGap, a novel post-hoc OOD detection method that
explicitly exploits the relationship between the maximum logit and the
remaining logits to enhance the separability between in-distribution (ID) and
OOD samples. To further improve its effectiveness, we refine LogitGap by
focusing on a more compact and informative subset of the logit space.
Specifically, we introduce a training-free strategy that automatically
identifies the most informative logits for scoring. We provide both theoretical
analysis and empirical evidence to validate the effectiveness of our approach.
Extensive experiments on both vision-language and vision-only models
demonstrate that LogitGap consistently achieves state-of-the-art performance
across diverse OOD detection scenarios and benchmarks. Code is available at
https://github.com/GIT-LJc/LogitGap.
comment: Accepted by NeurIPS 2025
☆ Inverse Image-Based Rendering for Light Field Generation from Single Images
A concept of light-fields computed from multiple view images on regular grids
has proven its benefit for scene representations, and supported realistic
renderings of novel views and photographic effects such as refocusing and
shallow depth of field. In spite of its effectiveness of light flow
computations, obtaining light fields requires either computational costs or
specialized devices like a bulky camera setup and a specialized microlens
array. In an effort to broaden its benefit and applicability, in this paper, we
propose a novel view synthesis method for light field generation from only
single images, named inverse image-based rendering. Unlike previous attempts to
implicitly rebuild 3D geometry or to explicitly represent objective scenes, our
method reconstructs light flows in a space from image pixels, which behaves in
the opposite way to image-based rendering. To accomplish this, we design a
neural rendering pipeline to render a target ray in an arbitrary viewpoint. Our
neural renderer first stores the light flow of source rays from the input
image, then computes the relationships among them through cross-attention, and
finally predicts the color of the target ray based on these relationships.
After the rendering pipeline generates the first novel view from a single input
image, the generated out-of-view contents are updated to the set of source
rays. This procedure is iteratively performed while ensuring the consistent
generation of occluded contents. We demonstrate that our inverse image-based
rendering works well with various challenging datasets without any retraining
or finetuning after once trained on synthetic dataset, and outperforms relevant
state-of-the-art novel view synthesis methods.
☆ Physics-Guided Fusion for Robust 3D Tracking of Fast Moving Small Objects
While computer vision has advanced considerably for general object detection
and tracking, the specific problem of fast-moving tiny objects remains
underexplored. This paper addresses the significant challenge of detecting and
tracking rapidly moving small objects using an RGB-D camera. Our novel system
combines deep learning-based detection with physics-based tracking to overcome
the limitations of existing approaches. Our contributions include: (1) a
comprehensive system design for object detection and tracking of fast-moving
small objects in 3D space, (2) an innovative physics-based tracking algorithm
that integrates kinematics motion equations to handle outliers and missed
detections, and (3) an outlier detection and correction module that
significantly improves tracking performance in challenging scenarios such as
occlusions and rapid direction changes. We evaluated our proposed system on a
custom racquetball dataset. Our evaluation shows our system surpassing kalman
filter based trackers with up to 70\% less Average Displacement Error. Our
system has significant applications for improving robot perception on
autonomous platforms and demonstrates the effectiveness of combining
physics-based models with deep learning approaches for real-time 3D detection
and tracking of challenging small objects.
comment: 13 pages, 6 figures
☆ Why Prototypes Collapse: Diagnosing and Preventing Partial Collapse in Prototypical Self-Supervised Learning
Gabriel Y. Arteaga, Marius Aasan, Rwiddhi Chakraborty, Martine Hjelkrem-Tan, Thalles Silva, Michael Kampffmeyer, Adín Ramírez Rivera
Prototypical self-supervised learning methods consistently suffer from
partial prototype collapse, where multiple prototypes converge to nearly
identical representations. This undermines their central purpose -- providing
diverse and informative targets to guide encoders toward rich representations
-- and has led practitioners to over-parameterize prototype sets or add ad-hoc
regularizers, which mitigate symptoms rather than address the root cause. We
empirically trace the collapse to the joint optimization of encoders and
prototypes, which encourages a type of shortcut learning: early in training
prototypes drift toward redundant representations that minimize loss without
necessarily enhancing representation diversity. To break the joint
optimization, we introduce a fully decoupled training strategy that learns
prototypes and encoders under separate objectives. Concretely, we model
prototypes as a Gaussian mixture updated with an online EM-style procedure,
independent of the encoder's loss. This simple yet principled decoupling
eliminates prototype collapse without explicit regularization and yields
consistently diverse prototypes and stronger downstream performance.
☆ BIOCAP: Exploiting Synthetic Captions Beyond Labels in Biological Foundation Models
Ziheng Zhang, Xinyue Ma, Arpita Chowdhury, Elizabeth G. Campolongo, Matthew J. Thompson, Net Zhang, Samuel Stevens, Hilmar Lapp, Tanya Berger-Wolf, Yu Su, Wei-Lun Chao, Jianyang Gu
This work investigates descriptive captions as an additional source of
supervision for biological multimodal foundation models. Images and captions
can be viewed as complementary samples from the latent morphospace of a
species, each capturing certain biological traits. Incorporating captions
during training encourages alignment with this shared latent structure,
emphasizing potentially diagnostic characters while suppressing spurious
correlations. The main challenge, however, lies in obtaining faithful,
instance-specific captions at scale. This requirement has limited the
utilization of natural language supervision in organismal biology compared with
many other scientific domains. We complement this gap by generating synthetic
captions with multimodal large language models (MLLMs), guided by
Wikipedia-derived visual information and taxon-tailored format examples. These
domain-specific contexts help reduce hallucination and yield accurate,
instance-based descriptive captions. Using these captions, we train BIOCAP
(i.e., BIOCLIP with Captions), a biological foundation model that captures rich
semantics and achieves strong performance in species classification and
text-image retrieval. These results demonstrate the value of descriptive
captions beyond labels in bridging biological images with multimodal foundation
models.
comment: Project page: https://imageomics.github.io/biocap/
☆ StableSketcher: Enhancing Diffusion Model for Pixel-based Sketch Generation via Visual Question Answering Feedback IEEE
Although recent advancements in diffusion models have significantly enriched
the quality of generated images, challenges remain in synthesizing pixel-based
human-drawn sketches, a representative example of abstract expression. To
combat these challenges, we propose StableSketcher, a novel framework that
empowers diffusion models to generate hand-drawn sketches with high prompt
fidelity. Within this framework, we fine-tune the variational autoencoder to
optimize latent decoding, enabling it to better capture the characteristics of
sketches. In parallel, we integrate a new reward function for reinforcement
learning based on visual question answering, which improves text-image
alignment and semantic consistency. Extensive experiments demonstrate that
StableSketcher generates sketches with improved stylistic fidelity, achieving
better alignment with prompts compared to the Stable Diffusion baseline.
Additionally, we introduce SketchDUO, to the best of our knowledge, the first
dataset comprising instance-level sketches paired with captions and
question-answer pairs, thereby addressing the limitations of existing datasets
that rely on image-label pairs. Our code and dataset will be made publicly
available upon acceptance.
comment: Under review at IEEE Access. Author-submitted preprint. Not the
IEEE-published version
☆ Attentive Convolution: Unifying the Expressivity of Self-Attention with Convolutional Efficiency
Self-attention (SA) has become the cornerstone of modern vision backbones for
its powerful expressivity over traditional Convolutions (Conv). However, its
quadratic complexity remains a critical bottleneck for practical applications.
Given that Conv offers linear complexity and strong visual priors, continuing
efforts have been made to promote the renaissance of Conv. However, a
persistent performance chasm remains, highlighting that these modernizations
have not yet captured the intrinsic expressivity that defines SA. In this
paper, we re-examine the design of the CNNs, directed by a key question: what
principles give SA its edge over Conv? As a result, we reveal two fundamental
insights that challenge the long-standing design intuitions in prior research
(e.g., Receptive field). The two findings are: (1) \textit{Adaptive routing}:
SA dynamically regulates positional information flow according to semantic
content, whereas Conv employs static kernels uniformly across all positions.
(2) \textit{Lateral inhibition}: SA induces score competition among token
weighting, effectively suppressing redundancy and sharpening representations,
whereas Conv filters lack such inhibitory dynamics and exhibit considerable
redundancy. Based on this, we propose \textit{Attentive Convolution} (ATConv),
a principled reformulation of the convolutional operator that intrinsically
injects these principles. Interestingly, with only $3\times3$ kernels, ATConv
consistently outperforms various SA mechanisms in fundamental vision tasks.
Building on ATConv, we introduce AttNet, a CNN family that can attain
\textbf{84.4\%} ImageNet-1K Top-1 accuracy with only 27M parameters. In
diffusion-based image generation, replacing all SA with the proposed $3\times
3$ ATConv in SiT-XL/2 reduces ImageNet FID by 0.15 in 400k steps with faster
sampling. Code is available at: github.com/price112/Attentive-Convolution.
☆ Endoshare: A Source Available Solution to De-Identify and Manage Surgical Videos
Video-based assessment and surgical data science can advance surgical
training, research, and quality improvement. However, widespread use remains
limited by heterogeneous recording formats and privacy concerns associated with
video sharing. We present Endoshare, a source-available, cross-platform
application for merging, standardizing, and de-identifying endoscopic videos in
minimally invasive surgery. Development followed the software development life
cycle with iterative, user-centered feedback. During the analysis phase, an
internal survey of clinicians and computer scientists based on ten usability
heuristics identified key requirements that guided a privacy-by-design
architecture. In the testing phase, an external clinician survey combined the
same heuristics with Technology Acceptance Model constructs to assess usability
and adoption, complemented by benchmarking across different hardware
configurations. Four clinicians and four computer scientists initially tested
the prototype, reporting high usability (4.68 +/- 0.40/5 and 4.03 +/- 0.51/5),
with the lowest score (4.00 +/- 0.93/5) relating to label clarity. After
refinement, the testing phase surveyed ten surgeons who reported high perceived
usefulness (5.07 +/- 1.75/7), ease of use (5.15 +/- 1.71/7), heuristic
usability (4.38 +/- 0.48/5), and strong recommendation (9.20 +/- 0.79/10).
Processing time varied with processing mode, video duration (both p <= 0.001),
and machine computational power (p = 0.041). Endoshare provides a transparent,
user-friendly pipeline for standardized, privacy-preserving surgical video
management. Compliance certification and broader interoperability validation
are needed to establish it as a deployable alternative to proprietary systems.
The software is available at https://camma-public.github.io/Endoshare/
comment: 13 pages, 6 figures. Source-available software:
https://camma-public.github.io/Endoshare/
♻ ☆ DragFlow: Unleashing DiT Priors with Region Based Supervision for Drag Editing
Drag-based image editing has long suffered from distortions in the target
region, largely because the priors of earlier base models, Stable Diffusion,
are insufficient to project optimized latents back onto the natural image
manifold. With the shift from UNet-based DDPMs to more scalable DiT with flow
matching (e.g., SD3.5, FLUX), generative priors have become significantly
stronger, enabling advances across diverse editing tasks. However, drag-based
editing has yet to benefit from these stronger priors. This work proposes the
first framework to effectively harness FLUX's rich prior for drag-based
editing, dubbed DragFlow, achieving substantial gains over baselines. We first
show that directly applying point-based drag editing to DiTs performs poorly:
unlike the highly compressed features of UNets, DiT features are insufficiently
structured to provide reliable guidance for point-wise motion supervision. To
overcome this limitation, DragFlow introduces a region-based editing paradigm,
where affine transformations enable richer and more consistent feature
supervision. Additionally, we integrate pretrained open-domain personalization
adapters (e.g., IP-Adapter) to enhance subject consistency, while preserving
background fidelity through gradient mask-based hard constraints. Multimodal
large language models (MLLMs) are further employed to resolve task ambiguities.
For evaluation, we curate a novel Region-based Dragging benchmark (ReD Bench)
featuring region-level dragging instructions. Extensive experiments on
DragBench-DR and ReD Bench show that DragFlow surpasses both point-based and
region-based baselines, setting a new state-of-the-art in drag-based image
editing. Code and datasets will be publicly available upon publication.
comment: Preprint
♻ ☆ Watermarking Autoregressive Image Generation NeurIPS 2025
Watermarking the outputs of generative models has emerged as a promising
approach for tracking their provenance. Despite significant interest in
autoregressive image generation models and their potential for misuse, no prior
work has attempted to watermark their outputs at the token level. In this work,
we present the first such approach by adapting language model watermarking
techniques to this setting. We identify a key challenge: the lack of reverse
cycle-consistency (RCC), wherein re-tokenizing generated image tokens
significantly alters the token sequence, effectively erasing the watermark. To
address this and to make our method robust to common image transformations,
neural compression, and removal attacks, we introduce (i) a custom
tokenizer-detokenizer finetuning procedure that improves RCC, and (ii) a
complementary watermark synchronization layer. As our experiments demonstrate,
our approach enables reliable and robust watermark detection with theoretically
grounded p-values. Code and models are available at
https://github.com/facebookresearch/wmar.
comment: NeurIPS 2025
♻ ☆ Tex-ViT: A Generalizable, Robust, Texture-based dual-branch cross-attention deepfake detector
Deepfakes, which employ GAN to produce highly realistic facial modification,
are widely regarded as the prevailing method. Traditional CNN have been able to
identify bogus media, but they struggle to perform well on different datasets
and are vulnerable to adversarial attacks due to their lack of robustness.
Vision transformers have demonstrated potential in the realm of image
classification problems, but they require enough training data. Motivated by
these limitations, this publication introduces Tex-ViT (Texture-Vision
Transformer), which enhances CNN features by combining ResNet with a vision
transformer. The model combines traditional ResNet features with a texture
module that operates in parallel on sections of ResNet before each
down-sampling operation. The texture module then serves as an input to the dual
branch of the cross-attention vision transformer. It specifically focuses on
improving the global texture module, which extracts feature map correlation.
Empirical analysis reveals that fake images exhibit smooth textures that do not
remain consistent over long distances in manipulations. Experiments were
performed on different categories of FF++, such as DF, f2f, FS, and NT,
together with other types of GAN datasets in cross-domain scenarios.
Furthermore, experiments also conducted on FF++, DFDCPreview, and Celeb-DF
dataset underwent several post-processing situations, such as blurring,
compression, and noise. The model surpassed the most advanced models in terms
of generalization, achieving a 98% accuracy in cross-domain scenarios. This
demonstrates its ability to learn the shared distinguishing textural
characteristics in the manipulated samples. These experiments provide evidence
that the proposed model is capable of being applied to various situations and
is resistant to many post-processing procedures.
♻ ☆ GenLit: Reformulating Single-Image Relighting as Video Generation
Manipulating the illumination of a 3D scene within a single image represents
a fundamental challenge in computer vision and graphics. This problem has
traditionally been addressed using inverse rendering techniques, which involve
explicit 3D asset reconstruction and costly ray-tracing simulations. Meanwhile,
recent advancements in visual foundation models suggest that a new paradigm
could soon be possible -- one that replaces explicit physical models with
networks that are trained on large amounts of image and video data. In this
paper, we exploit the implicit scene understanding of a video diffusion model,
particularly Stable Video Diffusion, to relight a single image. We introduce
GenLit, a framework that distills the ability of a graphics engine to perform
light manipulation into a video-generation model, enabling users to directly
insert and manipulate a point light in the 3D world within a given image and
generate results directly as a video sequence. We find that a model fine-tuned
on only a small synthetic dataset generalizes to real-world scenes, enabling
single-image relighting with plausible and convincing shadows and
inter-reflections. Our results highlight the ability of video foundation models
to capture rich information about lighting, material, and shape, and our
findings indicate that such models, with minimal training, can be used to
perform relighting without explicit asset reconstruction or ray-tracing. .
Project page: https://genlit.is.tue.mpg.de/.
♻ ☆ mmWalk: Towards Multi-modal Multi-view Walking Assistance NeurIPS 2025
Kedi Ying, Ruiping Liu, Chongyan Chen, Mingzhe Tao, Hao Shi, Kailun Yang, Jiaming Zhang, Rainer Stiefelhagen
Walking assistance in extreme or complex environments remains a significant
challenge for people with blindness or low vision (BLV), largely due to the
lack of a holistic scene understanding. Motivated by the real-world needs of
the BLV community, we build mmWalk, a simulated multi-modal dataset that
integrates multi-view sensor and accessibility-oriented features for outdoor
safe navigation. Our dataset comprises 120 manually controlled,
scenario-categorized walking trajectories with 62k synchronized frames. It
contains over 559k panoramic images across RGB, depth, and semantic modalities.
Furthermore, to emphasize real-world relevance, each trajectory involves
outdoor corner cases and accessibility-specific landmarks for BLV users.
Additionally, we generate mmWalkVQA, a VQA benchmark with over 69k visual
question-answer triplets across 9 categories tailored for safe and informed
walking assistance. We evaluate state-of-the-art Vision-Language Models (VLMs)
using zero- and few-shot settings and found they struggle with our risk
assessment and navigational tasks. We validate our mmWalk-finetuned model on
real-world datasets and show the effectiveness of our dataset for advancing
multi-modal walking assistance.
comment: Accepted by NeurIPS 2025 Datasets and Benchmarks Track. Data and
Code: https://github.com/KediYing/mmWalk
♻ ☆ Fast-Slow Thinking GRPO for Large Vision-Language Model Reasoning
When applying reinforcement learning--typically through GRPO--to large
vision-language model reasoning struggles to effectively scale reasoning length
or generates verbose outputs across all tasks with only marginal gains in
accuracy. To address this issue, we present FAST-GRPO, a variant of GRPO that
dynamically adapts reasoning depth based on question characteristics. Through
empirical analysis, we establish the feasibility of fast-slow thinking in LVLMs
by investigating how response length and data distribution affect performance.
Inspired by these observations, we introduce two complementary metrics to
estimate the difficulty of the questions, guiding the model to determine when
fast or slow thinking is more appropriate. Next, we incorporate adaptive
length-based rewards and difficulty-aware KL divergence into the GRPO
algorithm. Experiments across seven reasoning benchmarks demonstrate that FAST
achieves state-of-the-art accuracy with over 10\% relative improvement compared
to the base model, while reducing token usage by 32.7-67.3\% compared to
previous slow-thinking approaches, effectively balancing reasoning length and
accuracy.
♻ ☆ Structured Spectral Graph Representation Learning for Multi-label Abnormality Analysis from 3D CT Scans
With the growing volume of CT examinations, there is an increasing demand for
automated tools such as organ segmentation, abnormality detection, and report
generation to support radiologists in managing their clinical workload.
Multi-label classification of 3D Chest CT scans remains a critical yet
challenging problem due to the complex spatial relationships inherent in
volumetric data and the wide variability of abnormalities. Existing methods
based on 3D convolutional neural networks struggle to capture long-range
dependencies, while Vision Transformers often require extensive pre-training on
large-scale, domain-specific datasets to perform competitively. In this work of
academic research, we propose a 2.5D alternative by introducing a new
graph-based framework that represents 3D CT volumes as structured graphs, where
axial slice triplets serve as nodes processed through spectral graph
convolution, enabling the model to reason over inter-slice dependencies while
maintaining complexity compatible with clinical deployment. Our method, trained
and evaluated on 3 datasets from independent institutions, achieves strong
cross-dataset generalization, and shows competitive performance compared to
state-of-the-art visual encoders. We further conduct comprehensive ablation
studies to evaluate the impact of various aggregation strategies,
edge-weighting schemes, and graph connectivity patterns. Additionally, we
demonstrate the broader applicability of our approach through transfer
experiments on automated radiology report generation and abdominal CT data.
comment: 24 pages, 15 figures
♻ ☆ FreeGraftor: Training-Free Cross-Image Feature Grafting for Subject-Driven Text-to-Image Generation
Subject-driven image generation aims to synthesize novel scenes that
faithfully preserve subject identity from reference images while adhering to
textual guidance. However, existing methods struggle with a critical trade-off
between fidelity and efficiency. Tuning-based approaches rely on time-consuming
and resource-intensive, subject-specific optimization, while zero-shot methods
often fail to maintain adequate subject consistency. In this work, we propose
FreeGraftor, a training-free framework that addresses these limitations through
cross-image feature grafting. Specifically, FreeGraftor leverages semantic
matching and position-constrained attention fusion to transfer visual details
from reference subjects to the generated images. Additionally, our framework
introduces a novel noise initialization strategy to preserve the geometry
priors of reference subjects, facilitating robust feature matching. Extensive
qualitative and quantitative experiments demonstrate that our method enables
precise subject identity transfer while maintaining text-aligned scene
synthesis. Without requiring model fine-tuning or additional training,
FreeGraftor significantly outperforms existing zero-shot and training-free
approaches in both subject fidelity and text alignment. Furthermore, our
framework can seamlessly extend to multi-subject generation, making it
practical for real-world deployment. Our code is available at
https://github.com/Nihukat/FreeGraftor.
comment: Code: https://github.com/Nihukat/FreeGraftor
♻ ☆ Uncovering Anomalous Events for Marine Environmental Monitoring via Visual Anomaly Detection
Underwater video monitoring is a promising strategy for assessing marine
biodiversity, but the vast volume of uneventful footage makes manual inspection
highly impractical. In this work, we explore the use of visual anomaly
detection (VAD) based on deep neural networks to automatically identify
interesting or anomalous events. We introduce AURA, the first multi-annotator
benchmark dataset for underwater VAD, and evaluate four VAD models across two
marine scenes. We demonstrate the importance of robust frame selection
strategies to extract meaningful video segments. Our comparison against
multiple annotators reveals that VAD performance of current models varies
dramatically and is highly sensitive to both the amount of training data and
the variability in visual content that defines "normal" scenes. Our results
highlight the value of soft and consensus labels and offer a practical approach
for supporting scientific exploration and scalable biodiversity monitoring.
♻ ☆ X-Reflect: Cross-Reflection Prompting for Multimodal Recommendation
Large Language Models (LLMs) have been shown to enhance the effectiveness of
enriching item descriptions, thereby improving the accuracy of recommendation
systems. However, most existing approaches either rely on text-only prompting
or employ basic multimodal strategies that do not fully exploit the
complementary information available from both textual and visual modalities.
This paper introduces a novel framework, Cross-Reflection Prompting, termed
X-Reflect, designed to address these limitations by prompting Multimodal Large
Language Models (MLLMs) to explicitly identify and reconcile supportive and
conflicting information between text and images. By capturing nuanced insights
from both modalities, this approach generates more comprehensive and
contextually rich item representations. Extensive experiments conducted on two
widely used benchmarks demonstrate that our method outperforms existing
prompting baselines in downstream recommendation accuracy. Furthermore, we
identify a U-shaped relationship between text-image dissimilarity and
recommendation performance, suggesting the benefit of applying multimodal
prompting selectively. To support efficient real-time inference, we also
introduce X-Reflect-keyword, a lightweight variant that summarizes image
content using keywords and replaces the base model with a smaller backbone,
achieving nearly 50% reduction in input length while maintaining competitive
performance. This work underscores the importance of integrating multimodal
information and presents an effective solution for improving item understanding
in multimodal recommendation systems.
♻ ☆ CALM-PDE: Continuous and Adaptive Convolutions for Latent Space Modeling of Time-dependent PDEs NeurIPS
Solving time-dependent Partial Differential Equations (PDEs) using a densely
discretized spatial domain is a fundamental problem in various scientific and
engineering disciplines, including modeling climate phenomena and fluid
dynamics. However, performing these computations directly in the physical space
often incurs significant computational costs. To address this issue, several
neural surrogate models have been developed that operate in a compressed latent
space to solve the PDE. While these approaches reduce computational complexity,
they often use Transformer-based attention mechanisms to handle irregularly
sampled domains, resulting in increased memory consumption. In contrast,
convolutional neural networks allow memory-efficient encoding and decoding but
are limited to regular discretizations. Motivated by these considerations, we
propose CALM-PDE, a model class that efficiently solves arbitrarily discretized
PDEs in a compressed latent space. We introduce a novel continuous
convolution-based encoder-decoder architecture that uses an
epsilon-neighborhood-constrained kernel and learns to apply the convolution
operator to adaptive and optimized query points. We demonstrate the
effectiveness of CALM-PDE on a diverse set of PDEs with both regularly and
irregularly sampled spatial domains. CALM-PDE is competitive with or
outperforms existing baseline methods while offering significant improvements
in memory and inference time efficiency compared to Transformer-based methods.
comment: Accepted for publication at the 39th Conference on Neural Information
Processing Systems (NeurIPS) 2025, San Diego, California, USA
♻ ☆ REOBench: Benchmarking Robustness of Earth Observation Foundation Models
Xiang Li, Yong Tao, Siyuan Zhang, Siwei Liu, Zhitong Xiong, Chunbo Luo, Lu Liu, Mykola Pechenizkiy, Xiao Xiang Zhu, Tianjin Huang
Earth observation foundation models have shown strong generalization across
multiple Earth observation tasks, but their robustness under real-world
perturbations remains underexplored. To bridge this gap, we introduce REOBench,
the first comprehensive benchmark for evaluating the robustness of Earth
observation foundation models across six tasks and twelve types of image
corruptions, including both appearance-based and geometric perturbations. To
ensure realistic and fine-grained evaluation, our benchmark focuses on
high-resolution optical remote sensing images, which are widely used in
critical applications such as urban planning and disaster response. We conduct
a systematic evaluation of a broad range of models trained using masked image
modeling, contrastive learning, and vision-language pre-training paradigms. Our
results reveal that (1) existing Earth observation foundation models experience
significant performance degradation when exposed to input corruptions. (2) The
severity of degradation varies across tasks, model architectures, backbone
sizes, and types of corruption, with performance drop varying from less than 1%
to over 20%. (3) Vision-language models show enhanced robustness, particularly
in multimodal tasks. REOBench underscores the vulnerability of current Earth
observation foundation models to real-world corruptions and provides actionable
insights for developing more robust and reliable models. Code and data are
publicly available at https://github.com/lx709/REOBench.
comment: Accepted to NeruIPS 2025 D&B Track
♻ ☆ BioCLIP 2: Emergent Properties from Scaling Hierarchical Contrastive Learning NeurIPS 2025
Jianyang Gu, Samuel Stevens, Elizabeth G Campolongo, Matthew J Thompson, Net Zhang, Jiaman Wu, Andrei Kopanev, Zheda Mai, Alexander E. White, James Balhoff, Wasila Dahdul, Daniel Rubenstein, Hilmar Lapp, Tanya Berger-Wolf, Wei-Lun Chao, Yu Su
Foundation models trained at scale exhibit remarkable emergent behaviors,
learning new capabilities beyond their initial training objectives. We find
such emergent behaviors in biological vision models via large-scale contrastive
vision-language training. To achieve this, we first curate TreeOfLife-200M,
comprising 214 million images of living organisms, the largest and most diverse
biological organism image dataset to date. We then train BioCLIP 2 on
TreeOfLife-200M to distinguish different species. Despite the narrow training
objective, BioCLIP 2 yields extraordinary accuracy when applied to various
biological visual tasks such as habitat classification and trait prediction. We
identify emergent properties in the learned embedding space of BioCLIP 2. At
the inter-species level, the embedding distribution of different species aligns
closely with functional and ecological meanings (e.g., beak sizes and
habitats). At the intra-species level, instead of being diminished, the
intra-species variations (e.g., life stages and sexes) are preserved and better
separated in subspaces orthogonal to inter-species distinctions. We provide
formal proof and analyses to explain why hierarchical supervision and
contrastive objectives encourage these emergent properties. Crucially, our
results reveal that these properties become increasingly significant with
larger-scale training data, leading to a biologically meaningful embedding
space.
comment: NeurIPS 2025 Spotlight; Project page:
https://imageomics.github.io/bioclip-2/
♻ ☆ Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?
Reinforcement Learning with Verifiable Rewards (RLVR) has recently
demonstrated notable success in enhancing the reasoning performance of large
language models (LLMs), particularly on mathematics and programming tasks.
Similar to how traditional RL helps agents explore and learn new strategies,
RLVR is believed to enable LLMs to continuously self-improve, thus acquiring
novel reasoning abilities beyond those of the corresponding base models. In
this study we critically examine the current state of RLVR by systematically
probing the reasoning capability boundaries of RLVR-trained LLMs across various
model families, RL algorithms, and math, coding, and visual reasoning
benchmarks, using pass@k at large k values as the evaluation metric.
Surprisingly, we find that the current training setup does not elicit
fundamentally new reasoning patterns. While RLVR-trained models outperform
their base models at small k (e.g., k = 1), the base models achieve a higher
pass@k score when k is large. Coverage and perplexity analyses show that the
observed reasoning abilities originate from and are bounded by the base model.
Treating the base model as an upper bound, our quantitative analysis shows that
six popular RLVR algorithms perform similarly and remain far from optimal in
leveraging the potential of the base model. By contrast, we find that
distillation can introduce new reasoning patterns from the teacher and
genuinely expand the model's reasoning capabilities. Overall, our findings
suggest that current RLVR methods have not yet realized the potential of RL to
elicit truly novel reasoning abilities in LLMs. This highlights the need for
improved RL paradigms, such as continual scaling and multi-turn
agent-environment interaction, to unlock this potential.
comment: 30 pages, 27 figures
♻ ☆ Residual Kolmogorov-Arnold Network for Enhanced Deep Learning
Despite their immense success, deep convolutional neural networks (CNNs) can
be difficult to optimize and costly to train due to hundreds of layers within
the network depth. Conventional convolutional operations are fundamentally
limited by their linear nature along with fixed activations, where many layers
are needed to learn meaningful patterns in data. Because of the sheer size of
these networks, this approach is simply computationally inefficient, and poses
overfitting or gradient explosion risks, especially in small datasets. As a
result, we introduce a "plug-in" module, called Residual Kolmogorov-Arnold
Network (RKAN). Our module is highly compact, so it can be easily added into
any stage (level) of traditional deep networks, where it learns to integrate
supportive polynomial feature transformations to existing convolutional
frameworks. RKAN offers consistent improvements over baseline models in
different vision tasks and widely tested benchmarks, accomplishing cutting-edge
performance on them.
comment: Code is available at https://github.com/withray/residualKAN.git
♻ ☆ A novel attention mechanism for noise-adaptive and robust segmentation of microtubules in microscopy images
Segmenting cytoskeletal filaments in microscopy images is essential for
understanding their cellular roles but remains challenging, especially in
dense, complex networks and under noisy or low-contrast image conditions. While
deep learning has advanced image segmentation, performance often degrades in
these adverse scenarios. Additional challenges include the difficulty of
obtaining accurate annotations and managing severe class imbalance. We proposed
a novel noise-adaptive attention mechanism, extending the
Squeeze-and-Excitation (SE) module, to dynamically adjust to varying noise
levels. This Adaptive SE (ASE) mechanism is integrated into a U-Net decoder,
with residual encoder blocks, forming a lightweight yet powerful model:
ASE_Res_U-Net. We also developed a synthetic-dataset strategy and employed
tailored loss functions and evaluation metrics to mitigate class imbalance and
ensure fair assessment. ASE_Res_U-Net effectively segmented microtubules in
both synthetic and real noisy images, outperforming its ablated variants and
state-of-the-art curvilinear-structure segmentation methods. It achieved this
while using fewer parameters, making it suitable for resource-constrained
environments. Importantly, ASE_Res_U-Net generalised well to other curvilinear
structures (blood vessels and nerves) under diverse imaging conditions.
Availability and implementation: Original microtubule datasets (synthetic and
real noisy images) are available on Zenodo (DOIs: 10.5281/zenodo.14696279 and
10.5281/zenodo.15852660). ASE_Res_UNet model will be shared upon publication.
♻ ☆ Spatial-DISE: A Unified Benchmark for Evaluating Spatial Reasoning in Vision-Language Models
Xinmiao Huang, Qisong He, Zhenglin Huang, Boxuan Wang, Zhuoyun Li, Guangliang Cheng, Yi Dong, Xiaowei Huang
Spatial reasoning ability is crucial for Vision Language Models (VLMs) to
support real-world applications in diverse domains including robotics,
augmented reality, and autonomous navigation. Unfortunately, existing
benchmarks are inadequate in assessing spatial reasoning ability, especially
the \emph{intrinsic-dynamic} spatial reasoning which is a fundamental aspect of
human spatial cognition. In this paper, we propose a unified benchmark,
\textbf{Spatial-DISE}, based on a cognitively grounded taxonomy that
categorizes tasks into four fundamental quadrants:
\textbf{I}ntrinsic-\textbf{S}tatic, Intrinsic-\textbf{D}ynamic,
\textbf{E}xtrinsic-Static, and Extrinsic-Dynamic spatial reasoning. Moreover,
to address the issue of data scarcity, we develop a scalable and automated
pipeline to generate diverse and verifiable spatial reasoning questions,
resulting in a new \textbf{Spatial-DISE} dataset that includes Spatial-DISE
Bench (559 evaluation VQA pairs) and Spatial-DISE-12K (12K+ training VQA
pairs). Our comprehensive evaluation across 28 state-of-the-art VLMs reveals
that, current VLMs have a large and consistent gap to human competence,
especially on multi-step multi-view spatial reasoning. Spatial-DISE offers a
robust framework, valuable dataset, and clear direction for future research
toward human-like spatial intelligence. Benchmark, dataset, and code will be
publicly released.
comment: Project Page: https://shinmohuang.github.io/spatialdise_page/
♻ ☆ BevSplat: Resolving Height Ambiguity via Feature-Based Gaussian Primitives for Weakly-Supervised Cross-View Localization
This paper addresses the problem of weakly supervised cross-view
localization, where the goal is to estimate the pose of a ground camera
relative to a satellite image with noisy ground truth annotations. A common
approach to bridge the cross-view domain gap for pose estimation is Bird's-Eye
View (BEV) synthesis. However, existing methods struggle with height ambiguity
due to the lack of depth information in ground images and satellite height
maps. Previous solutions either assume a flat ground plane or rely on complex
models, such as cross-view transformers. We propose BevSplat, a novel method
that resolves height ambiguity by using feature-based Gaussian primitives. Each
pixel in the ground image is represented by a 3D Gaussian with semantic and
spatial features, which are synthesized into a BEV feature map for relative
pose estimation. Additionally, to address challenges with panoramic query
images, we introduce an icosphere-based supervision strategy for the Gaussian
primitives. We validate our method on the widely used KITTI and VIGOR datasets,
which include both pinhole and panoramic query images. Experimental results
show that BevSplat significantly improves localization accuracy over prior
approaches.
♻ ☆ PolyPose: Deformable 2D/3D Registration via Polyrigid Transformations NeurIPS 2025
Determining the 3D pose of a patient from a limited set of 2D X-ray images is
a critical task in interventional settings. While preoperative volumetric
imaging (e.g., CT and MRI) provides precise 3D localization and visualization
of anatomical targets, these modalities cannot be acquired during procedures,
where fast 2D imaging (X-ray) is used instead. To integrate volumetric guidance
into intraoperative procedures, we present PolyPose, a simple and robust method
for deformable 2D/3D registration. PolyPose parameterizes complex 3D
deformation fields as a composition of rigid transforms, leveraging the
biological constraint that individual bones do not bend in typical motion.
Unlike existing methods that either assume no inter-joint movement or fail
outright in this under-determined setting, our polyrigid formulation enforces
anatomically plausible priors that respect the piecewise-rigid nature of human
movement. This approach eliminates the need for expensive deformation
regularizers that require patient- and procedure-specific hyperparameter
optimization. Across extensive experiments on diverse datasets from orthopedic
surgery and radiotherapy, we show that this strong inductive bias enables
PolyPose to successfully align the patient's preoperative volume to as few as
two X-rays, thereby providing crucial 3D guidance in challenging sparse-view
and limited-angle settings where current registration methods fail. Additional
visualizations, tutorials, and code are available at
https://polypose.csail.mit.edu.
comment: NeurIPS 2025. Code available at
https://github.com/eigenvivek/polypose
♻ ☆ Frequency-Dynamic Attention Modulation for Dense Prediction ICCV 2025
Vision Transformers (ViTs) have significantly advanced computer vision,
demonstrating strong performance across various tasks. However, the attention
mechanism in ViTs makes each layer function as a low-pass filter, and the
stacked-layer architecture in existing transformers suffers from frequency
vanishing. This leads to the loss of critical details and textures. We propose
a novel, circuit-theory-inspired strategy called Frequency-Dynamic Attention
Modulation (FDAM), which can be easily plugged into ViTs. FDAM directly
modulates the overall frequency response of ViTs and consists of two
techniques: Attention Inversion (AttInv) and Frequency Dynamic Scaling
(FreqScale). Since circuit theory uses low-pass filters as fundamental
elements, we introduce AttInv, a method that generates complementary high-pass
filtering by inverting the low-pass filter in the attention matrix, and
dynamically combining the two. We further design FreqScale to weight different
frequency components for fine-grained adjustments to the target response
function. Through feature similarity analysis and effective rank evaluation, we
demonstrate that our approach avoids representation collapse, leading to
consistent performance improvements across various models, including SegFormer,
DeiT, and MaskDINO. These improvements are evident in tasks such as semantic
segmentation, object detection, and instance segmentation. Additionally, we
apply our method to remote sensing detection, achieving state-of-the-art
results in single-scale settings. The code is available at
https://github.com/Linwei-Chen/FDAM.
comment: Accepted by ICCV 2025
♻ ☆ A primal-dual algorithm for image reconstruction with input-convex neural network regularizers
We address the optimization problem in a data-driven variational
reconstruction framework, where the regularizer is parameterized by an
input-convex neural network (ICNN). While gradient-based methods are commonly
used to solve such problems, they struggle to effectively handle non-smooth
problems which often leads to slow convergence. Moreover, the nested structure
of the neural network complicates the application of standard non-smooth
optimization techniques, such as proximal algorithms. To overcome these
challenges, we reformulate the problem and eliminate the network's nested
structure. By relating this reformulation to epigraphical projections of the
activation functions, we transform the problem into a convex optimization
problem that can be efficiently solved using a primal-dual algorithm. We also
prove that this reformulation is equivalent to the original variational
problem. Through experiments on several imaging tasks, we show that the
proposed approach not only outperforms subgradient methods and even accelerated
methods in the smooth setting, but also facilitates the training of the
regularizer itself.
♻ ☆ MCIF: Multimodal Crosslingual Instruction-Following Benchmark from Scientific Talks
Sara Papi, Maike Züfle, Marco Gaido, Beatrice Savoldi, Danni Liu, Ioannis Douros, Luisa Bentivogli, Jan Niehues
Recent advances in large language models have catalyzed the development of
multimodal LLMs (MLLMs) that integrate text, speech, and vision within unified
frameworks. As MLLMs evolve from narrow, monolingual, task-specific systems to
general-purpose instruction-following models, a key frontier lies in evaluating
their multilingual and multimodal capabilities over both long and short
contexts. However, existing benchmarks fall short in evaluating these
dimensions jointly: they are often limited to English, mostly focus on one
single modality at a time, rely on short-form contexts, or lack human
annotations -- hindering comprehensive assessment of model performance across
languages, modalities, and task complexity. To address these gaps, we introduce
MCIF (Multimodal Crosslingual Instruction Following), the first multilingual
human-annotated benchmark based on scientific talks that is designed to
evaluate instruction-following in crosslingual, multimodal settings over both
short- and long-form inputs. MCIF spans three core modalities -- speech,
vision, and text -- and four diverse languages (English, German, Italian, and
Chinese), enabling a comprehensive evaluation of MLLMs' abilities to interpret
instructions across languages and combine them with multimodal contextual
information. MCIF is released under a CC-BY 4.0 license to encourage open
research and progress in MLLMs development.
comment: Data available at https://huggingface.co/datasets/FBK-MT/MCIF |
Evaluation and baselines available at https://github.com/hlt-mt/mcif
♻ ☆ Face-Human-Bench: A Comprehensive Benchmark of Face and Human Understanding for Multi-modal Assistants NeurIPS 2025
Lixiong Qin, Shilong Ou, Miaoxuan Zhang, Jiangning Wei, Yuhang Zhang, Xiaoshuai Song, Yuchen Liu, Mei Wang, Weiran Xu
Faces and humans are crucial elements in social interaction and are widely
included in everyday photos and videos. Therefore, a deep understanding of
faces and humans will enable multi-modal assistants to achieve improved
response quality and broadened application scope. Currently, the multi-modal
assistant community lacks a comprehensive and scientific evaluation of face and
human understanding abilities. In this paper, we first propose a hierarchical
ability taxonomy that includes three levels of abilities. Then, based on this
taxonomy, we collect images and annotations from publicly available datasets in
the face and human community and build a semi-automatic data pipeline to
produce problems for the new benchmark. Finally, the obtained Face-Human-Bench
includes a development set and a test set, each with 1800 problems, supporting
both English and Chinese. We conduct evaluations over 25 mainstream multi-modal
large language models (MLLMs) with our Face-Human-Bench, focusing on the
correlation between abilities, the impact of the relative position of targets
on performance, and the impact of Chain of Thought (CoT) prompting on
performance. We also explore which abilities of MLLMs need to be supplemented
by specialist models. The dataset and evaluation code have been made publicly
available at https://face-human-bench.github.io.
comment: 50 pages, 14 figures, 42 tables. NeurIPS 2025 Datasets and Benchmarks
Track
♻ ☆ Grounding Language with Vision: A Conditional Mutual Information Calibrated Decoding Strategy for Reducing Hallucinations in LVLMs
Large Vision-Language Models (LVLMs) are susceptible to hallucinations, where
generated responses seem semantically plausible yet exhibit little or no
relevance to the input image. Previous studies reveal that this issue primarily
stems from LVLMs' over-reliance on language priors while disregarding the
visual information during decoding. To alleviate this issue, we introduce a
novel Conditional Pointwise Mutual Information (C-PMI) calibrated decoding
strategy, which adaptively strengthens the mutual dependency between generated
texts and input images to mitigate hallucinations. Unlike existing methods
solely focusing on text token sampling, we propose to jointly model the
contributions of visual and textual tokens to C-PMI, formulating hallucination
mitigation as a bi-level optimization problem aimed at maximizing mutual
information. To solve it, we design a token purification mechanism that
dynamically regulates the decoding process by sampling text tokens remaining
maximally relevant to the given image, while simultaneously refining image
tokens most pertinent to the generated response. Extensive experiments across
various benchmarks reveal that the proposed method significantly reduces
hallucinations in LVLMs while preserving decoding efficiency.
♻ ☆ Learning Dense Hand Contact Estimation from Imbalanced Data NeurIPS 2025
Hands are essential to human interaction, and exploring contact between hands
and the world can promote comprehensive understanding of their function.
Recently, there have been growing number of hand interaction datasets that
cover interaction with object, other hand, scene, and body. Despite the
significance of the task and increasing high-quality data, how to effectively
learn dense hand contact estimation remains largely underexplored. There are
two major challenges for learning dense hand contact estimation. First, there
exists class imbalance issue from hand contact datasets where majority of
regions are not in contact. Second, hand contact datasets contain spatial
imbalance issue with most of hand contact exhibited in finger tips, resulting
in challenges for generalization towards contacts in other hand regions. To
tackle these issues, we present a framework that learns dense HAnd COntact
estimation (HACO) from imbalanced data. To resolve the class imbalance issue,
we introduce balanced contact sampling, which builds and samples from multiple
sampling groups that fairly represent diverse contact statistics for both
contact and non-contact vertices. Moreover, to address the spatial imbalance
issue, we propose vertex-level class-balanced (VCB) loss, which incorporates
spatially varying contact distribution by separately reweighting loss
contribution of each vertex based on its contact frequency across dataset. As a
result, we effectively learn to predict dense hand contact estimation with
large-scale hand contact data without suffering from class and spatial
imbalance issue. The codes are available at
https://github.com/dqj5182/HACO_RELEASE.
comment: Accepted at NeurIPS 2025. Project page: http://haco-release.github.io
♻ ☆ HumanCM: One Step Human Motion Prediction
We present HumanCM, a one-step human motion prediction framework built upon
consistency models. Instead of relying on multi-step denoising as in
diffusion-based methods, HumanCM performs efficient single-step generation by
learning a self-consistent mapping between noisy and clean motion states. The
framework adopts a Transformer-based spatiotemporal architecture with temporal
embeddings to model long-range dependencies and preserve motion coherence.
Experiments on Human3.6M and HumanEva-I demonstrate that HumanCM achieves
comparable or superior accuracy to state-of-the-art diffusion models while
reducing inference steps by up to two orders of magnitude.
comment: 6 pages, 3 figures, 2 tables
♻ ☆ Balanced Token Pruning: Accelerating Vision Language Models Beyond Local Optimization
Large Vision-Language Models (LVLMs) have shown impressive performance across
multi-modal tasks by encoding images into thousands of tokens. However, the
large number of image tokens results in significant computational overhead, and
the use of dynamic high-resolution inputs further increases this burden.
Previous approaches have attempted to reduce the number of image tokens through
token pruning, typically by selecting tokens based on attention scores or image
token diversity. Through empirical studies, we observe that existing methods
often overlook the joint impact of pruning on both the current layer's output
(local) and the outputs of subsequent layers (global), leading to suboptimal
pruning decisions. To address this challenge, we propose Balanced Token Pruning
(BTP), a plug-and-play method for pruning vision tokens. Specifically, our
method utilizes a small calibration set to divide the pruning process into
multiple stages. In the early stages, our method emphasizes the impact of
pruning on subsequent layers, whereas in the deeper stages, the focus shifts
toward preserving the consistency of local outputs. Extensive experiments
across various LVLMs demonstrate the broad effectiveness of our approach on
multiple benchmarks. Our method achieves a 78% compression rate while
preserving 96.7% of the original models' performance on average. Our code is
available at
https://github.com/EmbodiedCity/NeurIPS2025-Balanced-Token-Pruning.
comment: Accepted by Neurips 2025
♻ ☆ Frequency Cam: Imaging Periodic Signals in Real-Time
Due to their high temporal resolution and large dynamic range, event cameras
are uniquely suited for the analysis of time-periodic signals in an image. In
this work we present an efficient and fully asynchronous event camera algorithm
for detecting the fundamental frequency at which image pixels flicker. The
algorithm employs a second-order digital infinite impulse response (IIR) filter
to perform an approximate per-pixel brightness reconstruction and is more
robust to high-frequency noise than the baseline method we compare to. We
further demonstrate that using the falling edge of the signal leads to more
accurate period estimates than the rising edge, and that for certain signals
interpolating the zero-level crossings can further increase accuracy. Our
experiments find that the outstanding capabilities of the camera in detecting
frequencies up to 64kHz for a single pixel do not carry over to full sensor
imaging as readout bandwidth limitations become a serious obstacle. This
suggests that a hardware implementation closer to the sensor will allow for
greatly improved frequency imaging. We discuss the important design parameters
for fullsensor frequency imaging and present Frequency Cam, an open-source
implementation as a ROS node that can run on a single core of a laptop CPU at
more than 50 million events per second. It produces results that are
qualitatively very similar to those obtained from the closed source vibration
analysis module in Prophesee's Metavision Toolkit. The code for Frequency Cam
and a demonstration video can be found at
https://github.com/ros-event-camera/frequency_cam
comment: 13 pages, 16 figures, one table
♻ ☆ Mesh-RFT: Enhancing Mesh Generation via Fine-grained Reinforcement Fine-Tuning NeurIPS 2025
Jian Liu, Jing Xu, Song Guo, Jing Li, Jingfeng Guo, Jiaao Yu, Haohan Weng, Biwen Lei, Xianghui Yang, Zhuo Chen, Fangqi Zhu, Tao Han, Chunchao Guo
Existing pretrained models for 3D mesh generation often suffer from data
biases and produce low-quality results, while global reinforcement learning
(RL) methods rely on object-level rewards that struggle to capture local
structure details. To address these challenges, we present Mesh-RFT, a novel
fine-grained reinforcement fine-tuning framework that employs Masked Direct
Preference Optimization (M-DPO) to enable localized refinement via
quality-aware face masking. To facilitate efficient quality evaluation, we
introduce an objective topology-aware scoring system to evaluate geometric
integrity and topological regularity at both object and face levels through two
metrics: Boundary Edge Ratio (BER) and Topology Score (TS). By integrating
these metrics into a fine-grained RL strategy, Mesh-RFT becomes the first
method to optimize mesh quality at the granularity of individual faces,
resolving localized errors while preserving global coherence. Experiment
results show that our M-DPO approach reduces Hausdorff Distance (HD) by 24.6%
and improves Topology Score (TS) by 3.8% over pre-trained models, while
outperforming global DPO methods with a 17.4% HD reduction and 4.9% TS gain.
These results demonstrate Mesh-RFT's ability to improve geometric integrity and
topological regularity, achieving new state-of-the-art performance in
production-ready mesh generation. Project Page:
https://hitcslj.github.io/mesh-rft/.
comment: NeurIPS 2025, Spotlight
♻ ☆ Occluded nuScenes: A Multi-Sensor Dataset for Evaluating Perception Robustness in Automated Driving
Sanjay Kumar, Tim Brophy, Reenu Mohandas, Eoin Martino Grua, Ganesh Sistu, Valentina Donzella, Ciaran Eising
Robust perception in automated driving requires reliable performance under
adverse conditions, where sensors may be affected by partial failures or
environmental occlusions. Although existing autonomous driving datasets
inherently contain sensor noise and environmental variability, very few enable
controlled, parameterised, and reproducible degradations across multiple
sensing modalities. This gap limits the ability to systematically evaluate how
perception and fusion architectures perform under well-defined adverse
conditions. To address this limitation, we introduce the Occluded nuScenes
Dataset, a novel extension of the widely used nuScenes benchmark. For the
camera modality, we release both the full and mini versions with four types of
occlusions, two adapted from public implementations and two newly designed. For
radar and LiDAR, we provide parameterised occlusion scripts that implement
three types of degradations each, enabling flexible and repeatable generation
of occluded data. This resource supports consistent, reproducible evaluation of
perception models under partial sensor failures and environmental interference.
By releasing the first multi-sensor occlusion dataset with controlled and
reproducible degradations, we aim to advance research on robust sensor fusion,
resilience analysis, and safety-critical perception in automated driving.
♻ ☆ Identity-Preserving Image-to-Video Generation via Reward-Guided Optimization
Recent advances in image-to-video (I2V) generation have achieved remarkable
progress in synthesizing high-quality, temporally coherent videos from static
images. Among all the applications of I2V, human-centric video generation
includes a large portion. However, existing I2V models encounter difficulties
in maintaining identity consistency between the input human image and the
generated video, especially when the person in the video exhibits significant
expression changes and movements. This issue becomes critical when the human
face occupies merely a small fraction of the image. Since humans are highly
sensitive to identity variations, this poses a critical yet under-explored
challenge in I2V generation. In this paper, we propose Identity-Preserving
Reward-guided Optimization (IPRO), a novel video diffusion framework based on
reinforcement learning to enhance identity preservation. Instead of introducing
auxiliary modules or altering model architectures, our approach introduces a
direct and effective tuning algorithm that optimizes diffusion models using a
face identity scorer. To improve performance and accelerate convergence, our
method backpropagates the reward signal through the last steps of the sampling
chain, enabling richer gradient feedback. We also propose a novel facial
scoring mechanism that treats faces in ground-truth videos as facial feature
pools, providing multi-angle facial information to enhance generalization. A
KL-divergence regularization is further incorporated to stabilize training and
prevent overfitting to the reward signal. Extensive experiments on Wan 2.2 I2V
model and our in-house I2V model demonstrate the effectiveness of our method.
Our project and code are available at https://ipro-alimama.github.io/.
♻ ☆ OpenMIBOOD: Open Medical Imaging Benchmarks for Out-Of-Distribution Detection
The growing reliance on Artificial Intelligence (AI) in critical domains such
as healthcare demands robust mechanisms to ensure the trustworthiness of these
systems, especially when faced with unexpected or anomalous inputs. This paper
introduces the Open Medical Imaging Benchmarks for Out-Of-Distribution
Detection (OpenMIBOOD), a comprehensive framework for evaluating
out-of-distribution (OOD) detection methods specifically in medical imaging
contexts. OpenMIBOOD includes three benchmarks from diverse medical domains,
encompassing 14 datasets divided into covariate-shifted in-distribution,
near-OOD, and far-OOD categories. We evaluate 24 post-hoc methods across these
benchmarks, providing a standardized reference to advance the development and
fair comparison of OOD detection methods. Results reveal that findings from
broad-scale OOD benchmarks in natural image domains do not translate to medical
applications, underscoring the critical need for such benchmarks in the medical
field. By mitigating the risk of exposing AI models to inputs outside their
training distribution, OpenMIBOOD aims to support the advancement of reliable
and trustworthy AI systems in healthcare. The repository is available at
https://github.com/remic-othr/OpenMIBOOD.
comment: Updated results for NNGuide and ViM
♻ ☆ ViSpec: Accelerating Vision-Language Models with Vision-Aware Speculative Decoding NeurIPS 2025
Speculative decoding is a widely adopted technique for accelerating inference
in large language models (LLMs), yet its application to vision-language models
(VLMs) remains underexplored, with existing methods achieving only modest
speedups (<1.5x). This gap is increasingly significant as multimodal
capabilities become central to large-scale models. We hypothesize that large
VLMs can effectively filter redundant image information layer by layer without
compromising textual comprehension, whereas smaller draft models struggle to do
so. To address this, we introduce Vision-Aware Speculative Decoding (ViSpec), a
novel framework tailored for VLMs. ViSpec employs a lightweight vision adaptor
module to compress image tokens into a compact representation, which is
seamlessly integrated into the draft model's attention mechanism while
preserving original image positional information. Additionally, we extract a
global feature vector for each input image and augment all subsequent text
tokens with this feature to enhance multimodal coherence. To overcome the
scarcity of multimodal datasets with long assistant responses, we curate a
specialized training dataset by repurposing existing datasets and generating
extended outputs using the target VLM with modified prompts. Our training
strategy mitigates the risk of the draft model exploiting direct access to the
target model's hidden states, which could otherwise lead to shortcut learning
when training solely on target model outputs. Extensive experiments validate
ViSpec, achieving, to our knowledge, the first substantial speedup in VLM
speculative decoding. Code is available at
https://github.com/KangJialiang/ViSpec.
comment: NeurIPS 2025
♻ ☆ MODEM: A Morton-Order Degradation Estimation Mechanism for Adverse Weather Image Recovery NeurIPS 2025
Restoring images degraded by adverse weather remains a significant challenge
due to the highly non-uniform and spatially heterogeneous nature of
weather-induced artifacts, e.g., fine-grained rain streaks versus widespread
haze. Accurately estimating the underlying degradation can intuitively provide
restoration models with more targeted and effective guidance, enabling adaptive
processing strategies. To this end, we propose a Morton-Order Degradation
Estimation Mechanism (MODEM) for adverse weather image restoration. Central to
MODEM is the Morton-Order 2D-Selective-Scan Module (MOS2D), which integrates
Morton-coded spatial ordering with selective state-space models to capture
long-range dependencies while preserving local structural coherence.
Complementing MOS2D, we introduce a Dual Degradation Estimation Module (DDEM)
that disentangles and estimates both global and local degradation priors. These
priors dynamically condition the MOS2D modules, facilitating adaptive and
context-aware restoration. Extensive experiments and ablation studies
demonstrate that MODEM achieves state-of-the-art results across multiple
benchmarks and weather types, highlighting its effectiveness in modeling
complex degradation dynamics. Our code will be released at
https://github.com/hainuo-wang/MODEM.git.
comment: Accepted by NeurIPS 2025
♻ ☆ FairGen: Enhancing Fairness in Text-to-Image Diffusion Models via Self-Discovering Latent Directions
While Diffusion Models (DM) exhibit remarkable performance across various
image generative tasks, they nonetheless reflect the inherent bias presented in
the training set. As DMs are now widely used in real-world applications, these
biases could perpetuate a distorted worldview and hinder opportunities for
minority groups. Existing methods on debiasing DMs usually requires model
retraining with a human-crafted reference dataset or additional classifiers,
which suffer from two major limitations: (1) collecting reference datasets
causes expensive annotation cost; (2) the debiasing performance is heavily
constrained by the quality of the reference dataset or the additional
classifier. To address the above limitations, we propose FairGen, a
plug-and-play method that learns attribute latent directions in a
self-discovering manner, thus eliminating the reliance on such reference
dataset. Specifically, FairGen consists of two parts: a set of attribute
adapters and a distribution indicator. Each adapter in the set aims to learn an
attribute latent direction, and is optimized via noise composition through a
self-discovering process. Then, the distribution indicator is multiplied by the
set of adapters to guide the generation process towards the prescribed
distribution. Our method enables debiasing multiple attributes in DMs
simultaneously, while remaining lightweight and easily integrable with other
DMs, eliminating the need for retraining. Extensive experiments on debiasing
gender, racial, and their intersectional biases show that our method
outperforms previous SOTA by a large margin.
♻ ☆ Rebalancing Contrastive Alignment with Bottlenecked Semantic Increments in Text-Video Retrieval
Recent progress in text-video retrieval has been largely driven by
contrastive learning. However, existing methods often overlook the effect of
the modality gap, which causes anchor representations to undergo in-place
optimization (i.e., optimization tension) that limits their alignment capacity.
Moreover, noisy hard negatives further distort the semantics of anchors. To
address these issues, we propose GARE, a Gap-Aware Retrieval framework that
introduces a learnable, pair-specific increment $\Delta_{ij}$ between text
$t_i$ and video $v_j$, redistributing gradients to relieve optimization tension
and absorb noise. We derive $\Delta_{ij}$ via a multivariate first-order Taylor
expansion of the InfoNCE loss under a trust-region constraint, showing that it
guides updates along locally consistent descent directions. A lightweight
neural module conditioned on the semantic gap couples increments across batches
for structure-aware correction. Furthermore, we regularize $\Delta$ through a
variational information bottleneck with relaxed compression, enhancing
stability and semantic consistency. Experiments on four benchmarks demonstrate
that GARE consistently improves alignment accuracy and robustness, validating
the effectiveness of gap-aware tension mitigation. Code is available at
https://github.com/musicman217/GARE-text-video-retrieval.
♻ ☆ Toward a Vision-Language Foundation Model for Medical Data: Multimodal Dataset and Benchmarks for Vietnamese PET/CT Report Generation NeurIPS 2025
Huu Tien Nguyen, Dac Thai Nguyen, The Minh Duc Nguyen, Trung Thanh Nguyen, Thao Nguyen Truong, Huy Hieu Pham, Johan Barthelemy, Minh Quan Tran, Thanh Tam Nguyen, Quoc Viet Hung Nguyen, Quynh Anh Chau, Hong Son Mai, Thanh Trung Nguyen, Phi Le Nguyen
Vision-Language Foundation Models (VLMs), trained on large-scale multimodal
datasets, have driven significant advances in Artificial Intelligence (AI) by
enabling rich cross-modal reasoning. Despite their success in general domains,
applying these models to medical imaging remains challenging due to the limited
availability of diverse imaging modalities and multilingual clinical data. Most
existing medical VLMs are trained on a subset of imaging modalities and focus
primarily on high-resource languages, thus limiting their generalizability and
clinical utility. To address these limitations, we introduce a novel
Vietnamese-language multimodal medical dataset consisting of 2,757 whole-body
PET/CT volumes from independent patients and their corresponding full-length
clinical reports. This dataset is designed to fill two pressing gaps in medical
AI development: (1) the lack of PET/CT imaging data in existing VLMs training
corpora, which hinders the development of models capable of handling functional
imaging tasks; and (2) the underrepresentation of low-resource languages,
particularly the Vietnamese language, in medical vision-language research. To
the best of our knowledge, this is the first dataset to provide comprehensive
PET/CT-report pairs in Vietnamese. We further introduce a training framework to
enhance VLMs' learning, including data augmentation and expert-validated test
sets. We conduct comprehensive experiments benchmarking state-of-the-art VLMs
on downstream tasks. The experimental results show that incorporating our
dataset significantly improves the performance of existing VLMs. We believe
this dataset and benchmark will serve as a pivotal step in advancing the
development of more robust VLMs for medical imaging, especially for
low-resource languages and clinical use in Vietnamese healthcare. The source
code is available at https://github.com/AIoT-Lab-BKAI/ViPET-ReportGen.
comment: 39th Conference on Neural Information Processing Systems (NeurIPS
2025)
♻ ☆ ControlFusion: A Controllable Image Fusion Framework with Language-Vision Degradation Prompts NeurIPS 2025
Current image fusion methods struggle to address the composite degradations
encountered in real-world imaging scenarios and lack the flexibility to
accommodate user-specific requirements. In response to these challenges, we
propose a controllable image fusion framework with language-vision prompts,
termed ControlFusion, which adaptively neutralizes composite degradations. On
the one hand, we develop a degraded imaging model that integrates physical
imaging mechanisms, including the Retinex theory and atmospheric scattering
principle, to simulate composite degradations, thereby providing potential for
addressing real-world complex degradations from the data level. On the other
hand, we devise a prompt-modulated restoration and fusion network that
dynamically enhances features with degradation prompts, enabling our method to
accommodate composite degradation of varying levels. Specifically, considering
individual variations in quality perception of users, we incorporate a text
encoder to embed user-specified degradation types and severity levels as
degradation prompts. We also design a spatial-frequency collaborative visual
adapter that autonomously perceives degradations in source images, thus
eliminating the complete dependence on user instructions. Extensive experiments
demonstrate that ControlFusion outperforms SOTA fusion methods in fusion
quality and degradation handling, particularly in countering real-world and
compound degradations with various levels. The source code is publicly
available at https://github.com/Linfeng-Tang/ControlFusion.
comment: Accepted to NeurIPS 2025. The code are available at
https://github.com/Linfeng-Tang/ControlFusion
♻ ☆ VO-DP: Semantic-Geometric Adaptive Diffusion Policy for Vision-Only Robotic Manipulation
Zehao Ni, Yonghao He, Lingfeng Qian, Jilei Mao, Fa Fu, Wei Sui, Hu Su, Junran Peng, Zhipeng Wang, Bin He
In the context of imitation learning, visuomotor-based diffusion policy
learning is one of the main directions in robotic manipulation. Most of these
approaches rely on point clouds as observation inputs and construct scene
representations through point clouds feature learning, which enables them to
achieve remarkable accuracy. However, the existing literature lacks an in-depth
exploration of vision-only solutions that have significant potential. In this
paper, we propose a Vision-Only and single-view Diffusion Policy learning
method (VO-DP) that leverages pretrained visual foundation models to achieve
effective fusion of semantic and geometric features. We utilize intermediate
features from VGGT incorporating semantic features from DINOv2 and geometric
features from Alternating Attention blocks. Features are fused via
cross-attention and spatially compressed with a CNN to form the input to the
policy head. Extensive experiments demonstrate that VO-DP not only outperforms
the vision-only baseline DP significantly but also exhibits distinct
performance trends against the point cloud-based method DP3: in simulation
tasks, VO-DP achieves an average success rate of 64.6% on par with DP3 64.0%
and far higher than DP 34.8%, while in real-world tasks, it reaches 87.9%,
outperforming both DP3 67.5% and DP 11.2% by a notable margin. Further
robustness evaluations confirm that VO-DP remains highly stable under varying
conditions including color, size, background, and lighting. Lastly, we
open-source a training library for robotic manipulation. Built on Accelerate,
this library supports multi-machine and multi-GPU parallel training, as well as
mixed precision training. It is compatible with visuomotor policies such as DP,
DP3 and VO-DP, and also supports the RoboTwin simulator.
♻ ☆ SPLite Hand: Sparsity-Aware Lightweight 3D Hand Pose Estimation
With the increasing ubiquity of AR/VR devices, the deployment of deep
learning models on edge devices has become a critical challenge. These devices
require real-time inference, low power consumption, and minimal latency. Many
framework designers face the conundrum of balancing efficiency and performance.
We design a light framework that adopts an encoder-decoder architecture and
introduces several key contributions aimed at improving both efficiency and
accuracy. We apply sparse convolution on a ResNet-18 backbone to exploit the
inherent sparsity in hand pose images, achieving a 42% end-to-end efficiency
improvement. Moreover, we propose our SPLite decoder. This new architecture
significantly boosts the decoding process's frame rate by 3.1x on the Raspberry
Pi 5, while maintaining accuracy on par. To further optimize performance, we
apply quantization-aware training, reducing memory usage while preserving
accuracy (PA-MPJPE increases only marginally from 9.0 mm to 9.1 mm on
FreiHAND). Overall, our system achieves a 2.98x speed-up on a Raspberry Pi 5
CPU (BCM2712 quad-core Arm A76 processor). Our method is also evaluated on
compound benchmark datasets, demonstrating comparable accuracy to
state-of-the-art approaches while significantly enhancing computational
efficiency.
comment: Accepted to AICCC 2025
♻ ☆ Rebellious Student: A Complementary Learning Framework for Background Feature Enhancement in Hyperspectral Anomaly Detection
A recent class of hyperspectral anomaly detection methods that can be trained
once on background datasets and then universally deployed -- without per-scene
retraining or parameter tuning -- has demonstrated remarkable efficiency and
robustness. Building upon this paradigm, we focus on the integration of
spectral and spatial cues and introduce a novel "Rebellious Student" framework
for complementary feature learning. Unlike conventional teacher-student
paradigms driven by imitation, our method intentionally trains the spatial
branch to diverge from the spectral teacher, thereby learning complementary
spatial patterns that the teacher fails to capture. A two-stage learning
strategy is adopted: (1) a spectral enhancement network is first trained via
reverse distillation to obtain robust background spectral representations; and
(2) a spatial network -- the rebellious student -- is subsequently optimized
using decorrelation losses that enforce feature orthogonality while maintaining
reconstruction fidelity to avoid irrelevant noise. Once trained, the framework
enhances both spectral and spatial background features, enabling parameter-free
and training-free anomaly detection when paired with conventional detectors.
Experiments on the HAD100 benchmark show substantial improvements over several
established baselines with modest computational overhead, confirming the
effectiveness of the proposed complementary learning paradigm. Our code is
publicly available at https://github.com/xjpp2016/FERS.
♻ ☆ SnapMoGen: Human Motion Generation from Expressive Texts
Text-to-motion generation has experienced remarkable progress in recent
years. However, current approaches remain limited to synthesizing motion from
short or general text prompts, primarily due to dataset constraints. This
limitation undermines fine-grained controllability and generalization to unseen
prompts. In this paper, we introduce SnapMoGen, a new text-motion dataset
featuring high-quality motion capture data paired with accurate, expressive
textual annotations. The dataset comprises 20K motion clips totaling 44 hours,
accompanied by 122K detailed textual descriptions averaging 48 words per
description (vs. 12 words of HumanML3D). Importantly, these motion clips
preserve original temporal continuity as they were in long sequences,
facilitating research in long-term motion generation and blending. We also
improve upon previous generative masked modeling approaches. Our model,
MoMask++, transforms motion into multi-scale token sequences that better
exploit the token capacity, and learns to generate all tokens using a single
generative masked transformer. MoMask++ achieves state-of-the-art performance
on both HumanML3D and SnapMoGen benchmarks. Additionally, we demonstrate the
ability to process casual user prompts by employing an LLM to reformat inputs
to align with the expressivity and narration style of SnapMoGen. Project
webpage: https://snap-research.github.io/SnapMoGen/
comment: Project Webpage: https://snap-research.github.io/SnapMoGen/
♻ ☆ The Faiss library
Matthijs Douze, Alexandr Guzhva, Chengqi Deng, Jeff Johnson, Gergely Szilvasy, Pierre-Emmanuel Mazaré, Maria Lomeli, Lucas Hosseini, Hervé Jégou
Vector databases typically manage large collections of embedding vectors.
Currently, AI applications are growing rapidly, and so is the number of
embeddings that need to be stored and indexed. The Faiss library is dedicated
to vector similarity search, a core functionality of vector databases. Faiss is
a toolkit of indexing methods and related primitives used to search, cluster,
compress and transform vectors. This paper describes the trade-off space of
vector search and the design principles of Faiss in terms of structure,
approach to optimization and interfacing. We benchmark key features of the
library and discuss a few selected applications to highlight its broad
applicability.
♻ ☆ Sign-In to the Lottery: Reparameterizing Sparse Training From Scratch NeurIPS 2025
The performance gap between training sparse neural networks from scratch
(PaI) and dense-to-sparse training presents a major roadblock for efficient
deep learning. According to the Lottery Ticket Hypothesis, PaI hinges on
finding a problem specific parameter initialization. As we show, to this end,
determining correct parameter signs is sufficient. Yet, they remain elusive to
PaI. To address this issue, we propose Sign-In, which employs a dynamic
reparameterization that provably induces sign flips. Such sign flips are
complementary to the ones that dense-to-sparse training can accomplish,
rendering Sign-In as an orthogonal method. While our experiments and theory
suggest performance improvements of PaI, they also carve out the main open
challenge to close the gap between PaI and dense-to-sparse training.
comment: Accepted at NeurIPS 2025
♻ ☆ A Survey on Cache Methods in Diffusion Models: Toward Efficient Multi-Modal Generation
Jiacheng Liu, Xinyu Wang, Yuqi Lin, Zhikai Wang, Peiru Wang, Peiliang Cai, Qinming Zhou, Zhengan Yan, Zexuan Yan, Zhengyi Shi, Chang Zou, Yue Ma, Linfeng Zhang
Diffusion Models have become a cornerstone of modern generative AI for their
exceptional generation quality and controllability. However, their inherent
\textit{multi-step iterations} and \textit{complex backbone networks} lead to
prohibitive computational overhead and generation latency, forming a major
bottleneck for real-time applications. Although existing acceleration
techniques have made progress, they still face challenges such as limited
applicability, high training costs, or quality degradation.
Against this backdrop, \textbf{Diffusion Caching} offers a promising
training-free, architecture-agnostic, and efficient inference paradigm. Its
core mechanism identifies and reuses intrinsic computational redundancies in
the diffusion process. By enabling feature-level cross-step reuse and
inter-layer scheduling, it reduces computation without modifying model
parameters. This paper systematically reviews the theoretical foundations and
evolution of Diffusion Caching and proposes a unified framework for its
classification and analysis.
Through comparative analysis of representative methods, we show that
Diffusion Caching evolves from \textit{static reuse} to \textit{dynamic
prediction}. This trend enhances caching flexibility across diverse tasks and
enables integration with other acceleration techniques such as sampling
optimization and model distillation, paving the way for a unified, efficient
inference framework for future multimodal and interactive applications. We
argue that this paradigm will become a key enabler of real-time and efficient
generative AI, injecting new vitality into both theory and practice of
\textit{Efficient Generative Intelligence}.
comment: 22 pages,2 figures
♻ ☆ Quantization-Aware Neuromorphic Architecture for Efficient Skin Disease Classification on Resource-Constrained Devices
Accurate and efficient skin lesion classification on edge devices is critical
for accessible dermatological care but remains challenging due to
computational, energy, and privacy constraints. We introduce QANA, a novel
quantization-aware neuromorphic architecture for incremental skin lesion
classification on resource-limited hardware. QANA effectively integrates ghost
modules, efficient channel attention, and squeeze-and-excitation blocks for
robust feature representation with low-latency and energy-efficient inference.
Its quantization-aware head and spike-compatible transformations enable
seamless conversion to spiking neural networks (SNNs) and deployment on
neuromorphic platforms. Evaluation on the large-scale HAM10000 benchmark and a
real-world clinical dataset shows that QANA achieves 91.6% Top-1 accuracy and
82.4% macro F1 on HAM10000, and 90.8%/81.7% on the clinical dataset,
significantly outperforming state-of-the-art CNN-to-SNN models under fair
comparison. Deployed on BrainChip Akida hardware, QANA achieves 1.5 ms
inference latency and 1.7,mJ energy per image, reducing inference latency and
energy use by over 94.6%/98.6% compared to GPU-based CNNs surpassing
state-of-the-art CNN-to-SNN conversion baselines. These results demonstrate the
effectiveness of QANA for accurate, real-time, and privacy-sensitive medical
analysis in edge environments.
♻ ☆ A Style-Based Profiling Framework for Quantifying the Synthetic-to-Real Gap in Autonomous Driving Datasets
Ensuring the reliability of autonomous driving perception systems requires
extensive environment-based testing, yet real-world execution is often
impractical. Synthetic datasets have therefore emerged as a promising
alternative, offering advantages such as cost-effectiveness, bias free
labeling, and controllable scenarios. However, the domain gap between synthetic
and real-world datasets remains a major obstacle to model generalization. To
address this challenge from a data-centric perspective, this paper introduces a
profile extraction and discovery framework for characterizing the style
profiles underlying both synthetic and real image datasets. We propose Style
Embedding Distribution Discrepancy (SEDD) as a novel evaluation metric. Our
framework combines Gram matrix-based style extraction with metric learning
optimized for intra-class compactness and inter-class separation to extract
style embeddings. Furthermore, we establish a benchmark using publicly
available datasets. Experiments are conducted on a variety of datasets and
sim-to-real methods, and the results show that our method is capable of
quantifying the synthetic-to-real gap. This work provides a standardized
profiling-based quality control paradigm that enables systematic diagnosis and
targeted enhancement of synthetic datasets, advancing future development of
data-driven autonomous driving systems.
comment: 7 pages, 4 figures
♻ ☆ Learning Contrastive Feature Representations for Facial Action Unit Detection
For the Facial Action Unit (AU) detection task, accurately capturing the
subtle facial differences between distinct AUs is essential for reliable
detection. Additionally, AU detection faces challenges from class imbalance and
the presence of noisy or false labels, which undermine detection accuracy. In
this paper, we introduce a novel contrastive learning framework aimed for AU
detection that incorporates both self-supervised and supervised signals,
thereby enhancing the learning of discriminative features for accurate AU
detection. To tackle the class imbalance issue, we employ a negative sample
re-weighting strategy that adjusts the step size of updating parameters for
minority and majority class samples. Moreover, to address the challenges posed
by noisy and false AU labels, we employ a sampling technique that encompasses
three distinct types of positive sample pairs. This enables us to inject
self-supervised signals into the supervised signal, effectively mitigating the
adverse effects of noisy labels. Our experimental assessments, conducted on
five widely-utilized benchmark datasets (BP4D, DISFA, BP4D+, GFT and
Aff-Wild2), underscore the superior performance of our approach compared to
state-of-the-art methods of AU detection. Our code is available at
https://github.com/Ziqiao-Shang/AUNCE.
♻ ☆ EasyOcc: 3D Pseudo-Label Supervision for Fully Self-Supervised Semantic Occupancy Prediction Models
Self-supervised models have recently achieved notable advancements,
particularly in the domain of semantic occupancy prediction. These models
utilize sophisticated loss computation strategies to compensate for the absence
of ground-truth labels. For instance, techniques such as novel view synthesis,
cross-view rendering, and depth estimation have been explored to address the
issue of semantic and depth ambiguity. However, such techniques typically incur
high computational costs and memory usage during the training stage, especially
in the case of novel view synthesis. To mitigate these issues, we propose 3D
pseudo-ground-truth labels generated by the foundation models Grounded-SAM and
Metric3Dv2, and harness temporal information for label densification. Our 3D
pseudo-labels can be easily integrated into existing models, which yields
substantial performance improvements, with mIoU increasing by 45\%, from 9.73
to 14.09, when implemented into the OccNeRF model. This stands in contrast to
earlier advancements in the field, which are often not readily transferable to
other architectures. Additionally, we propose a streamlined model, EasyOcc,
achieving 13.86 mIoU. This model conducts learning solely from our labels,
avoiding complex rendering strategies mentioned previously. Furthermore, our
method enables models to attain state-of-the-art performance when evaluated on
the full scene without applying the camera mask, with EasyOcc achieving 7.71
mIoU, outperforming the previous best model by 31\%. These findings highlight
the critical importance of foundation models, temporal context, and the choice
of loss computation space in self-supervised learning for comprehensive scene
understanding.
♻ ☆ PreFM: Online Audio-Visual Event Parsing via Predictive Future Modeling NeurIPS 2025
Audio-visual event parsing plays a crucial role in understanding multimodal
video content, but existing methods typically rely on offline processing of
entire videos with huge model sizes, limiting their real-time applicability. We
introduce Online Audio-Visual Event Parsing (On-AVEP), a novel paradigm for
parsing audio, visual, and audio-visual events by sequentially analyzing
incoming video streams. The On-AVEP task necessitates models with two key
capabilities: (1) Accurate online inference, to effectively distinguish events
with unclear and limited context in online settings, and (2) Real-time
efficiency, to balance high performance with computational constraints. To
cultivate these, we propose the Predictive Future Modeling (PreFM) framework
featured by (a) predictive multimodal future modeling to infer and integrate
beneficial future audio-visual cues, thereby enhancing contextual understanding
and (b) modality-agnostic robust representation along with focal temporal
prioritization to improve precision and generalization. Extensive experiments
on the UnAV-100 and LLP datasets show PreFM significantly outperforms
state-of-the-art methods by a large margin with significantly fewer parameters,
offering an insightful approach for real-time multimodal video understanding.
Code is available at https://github.com/XiaoYu-1123/PreFM.
comment: This paper is accepted by 39th Conference on Neural Information
Processing Systems (NeurIPS 2025)
♻ ☆ VITRIX-CLIPIN: Enhancing Fine-Grained Visual Understanding in CLIP via Instruction Editing Data and Long Captions NeurIPS 2025
Despite the success of Vision-Language Models (VLMs) like CLIP in aligning
vision and language, their proficiency in detailed, fine-grained visual
comprehension remains a key challenge. We present CLIP-IN, a novel framework
that bolsters CLIP's fine-grained perception through two core innovations.
Firstly, we leverage instruction-editing datasets, originally designed for
image manipulation, as a unique source of hard negative image-text pairs.
Coupled with a symmetric hard negative contrastive loss, this enables the model
to effectively distinguish subtle visual-semantic differences. Secondly,
CLIP-IN incorporates long descriptive captions, utilizing rotary positional
encodings to capture rich semantic context often missed by standard CLIP. Our
experiments demonstrate that CLIP-IN achieves substantial gains on the MMVP
benchmark and various fine-grained visual recognition tasks, without
compromising robust zero-shot performance on broader classification and
retrieval tasks. Critically, integrating CLIP-IN's visual representations into
Multimodal Large Language Models significantly reduces visual hallucinations
and enhances reasoning abilities. This work underscores the considerable
potential of synergizing targeted, instruction-based contrastive learning with
comprehensive descriptive information to elevate the fine-grained understanding
of VLMs.
comment: Accepted to NeurIPS 2025
♻ ☆ Direct Numerical Layout Generation for 3D Indoor Scene Synthesis via Spatial Reasoning
Realistic 3D indoor scene synthesis is vital for embodied AI and digital
content creation. It can be naturally divided into two subtasks: object
generation and layout generation. While recent generative models have
significantly advanced object-level quality and controllability, layout
generation remains challenging due to limited datasets. Existing methods either
overfit to these datasets or rely on predefined constraints to optimize
numerical layout that sacrifice flexibility. As a result, they fail to generate
scenes that are both open-vocabulary and aligned with fine-grained user
instructions. We introduce DirectLayout, a framework that directly generates
numerical 3D layouts from text descriptions using generalizable spatial
reasoning of large language models (LLMs). DirectLayout decomposes the
generation into three stages: producing a Bird's-Eye View (BEV) layout, lifting
it into 3D space, and refining object placements. To enable explicit spatial
reasoning and help the model grasp basic principles of object placement, we
employ Chain-of-Thought (CoT) Activation based on the 3D-Front dataset.
Additionally, we design CoT-Grounded Generative Layout Reward to enhance
generalization and spatial planning. During inference, DirectLayout addresses
asset-layout mismatches via Iterative Asset-Layout Alignment through in-context
learning. Extensive experiments demonstrate that DirectLayout achieves
impressive semantic consistency, generalization and physical plausibility.
comment: Project Page: https://directlayout.github.io/
♻ ☆ Revisiting End-to-End Learning with Slide-level Supervision in Computational Pathology NeurIPS 2025
Pre-trained encoders for offline feature extraction followed by multiple
instance learning (MIL) aggregators have become the dominant paradigm in
computational pathology (CPath), benefiting cancer diagnosis and prognosis.
However, performance limitations arise from the absence of encoder fine-tuning
for downstream tasks and disjoint optimization with MIL. While slide-level
supervised end-to-end (E2E) learning is an intuitive solution to this issue, it
faces challenges such as high computational demands and suboptimal results.
These limitations motivate us to revisit E2E learning. We argue that prior work
neglects inherent E2E optimization challenges, leading to performance
disparities compared to traditional two-stage methods. In this paper, we
pioneer the elucidation of optimization challenge caused by sparse-attention
MIL and propose a novel MIL called ABMILX. It mitigates this problem through
global correlation-based attention refinement and multi-head mechanisms. With
the efficient multi-scale random patch sampling strategy, an E2E trained ResNet
with ABMILX surpasses SOTA foundation models under the two-stage paradigm
across multiple challenging benchmarks, while remaining computationally
efficient (<10 RTX3090 hours). We show the potential of E2E learning in CPath
and calls for greater research focus in this area. The code is
https://github.com/DearCaat/E2E-WSI-ABMILX.
comment: published on NeurIPS 2025
♻ ☆ Vision-Centric Activation and Coordination for Multimodal Large Language Models
Multimodal large language models (MLLMs) integrate image features from visual
encoders with LLMs, demonstrating advanced comprehension capabilities. However,
mainstream MLLMs are solely supervised by the next-token prediction of textual
tokens, neglecting critical vision-centric information essential for analytical
abilities. To track this dilemma, we introduce VaCo, which optimizes MLLM
representations through Vision-Centric activation and Coordination from
multiple vision foundation models (VFMs). VaCo introduces visual discriminative
alignment to integrate task-aware perceptual features extracted from VFMs,
thereby unifying the optimization of both textual and visual outputs in MLLMs.
Specifically, we incorporate the learnable Modular Task Queries (MTQs) and
Visual Alignment Layers (VALs) into MLLMs, activating specific visual signals
under the supervision of diverse VFMs. To coordinate representation conflicts
across VFMs, the crafted Token Gateway Mask (TGM) restricts the information
flow among multiple groups of MTQs. Extensive experiments demonstrate that VaCo
significantly improves the performance of different MLLMs on various
benchmarks, showcasing its superior capabilities in visual comprehension.
♻ ☆ MARIS: Marine Open-Vocabulary Instance Segmentation with Geometric Enhancement and Semantic Alignment
Most existing underwater instance segmentation approaches are constrained by
close-vocabulary prediction, limiting their ability to recognize novel marine
categories. To support evaluation, we introduce \textbf{MARIS}
(\underline{Mar}ine Open-Vocabulary \underline{I}nstance
\underline{S}egmentation), the first large-scale fine-grained benchmark for
underwater Open-Vocabulary (OV) segmentation, featuring a limited set of seen
categories and diverse unseen categories. Although OV segmentation has shown
promise on natural images, our analysis reveals that transfer to underwater
scenes suffers from severe visual degradation (e.g., color attenuation) and
semantic misalignment caused by lack underwater class definitions. To address
these issues, we propose a unified framework with two complementary components.
The Geometric Prior Enhancement Module (\textbf{GPEM}) leverages stable
part-level and structural cues to maintain object consistency under degraded
visual conditions. The Semantic Alignment Injection Mechanism (\textbf{SAIM})
enriches language embeddings with domain-specific priors, mitigating semantic
ambiguity and improving recognition of unseen categories. Experiments show that
our framework consistently outperforms existing OV baselines both In-Domain and
Cross-Domain setting on MARIS, establishing a strong foundation for future
underwater perception research.
♻ ☆ VT-FSL: Bridging Vision and Text with LLMs for Few-Shot Learning NeurIPS 2025
Few-shot learning (FSL) aims to recognize novel concepts from only a few
labeled support samples. Recent studies enhance support features by
incorporating additional semantic information or designing complex semantic
fusion modules. However, they still suffer from hallucinating semantics that
contradict the visual evidence due to the lack of grounding in actual
instances, resulting in noisy guidance and costly corrections. To address these
issues, we propose a novel framework, bridging Vision and Text with LLMs for
Few-Shot Learning (VT-FSL), which constructs precise cross-modal prompts
conditioned on Large Language Models (LLMs) and support images, seamlessly
integrating them through a geometry-aware alignment. It mainly consists of
Cross-modal Iterative Prompting (CIP) and Cross-modal Geometric Alignment
(CGA). Specifically, the CIP conditions an LLM on both class names and support
images to generate precise class descriptions iteratively in a single
structured reasoning pass. These descriptions not only enrich the semantic
understanding of novel classes but also enable the zero-shot synthesis of
semantically consistent images. The descriptions and synthetic images act
respectively as complementary textual and visual prompts, providing high-level
class semantics and low-level intra-class diversity to compensate for limited
support data. Furthermore, the CGA jointly aligns the fused textual, support,
and synthetic visual representations by minimizing the kernelized volume of the
3-dimensional parallelotope they span. It captures global and nonlinear
relationships among all representations, enabling structured and consistent
multimodal integration. The proposed VT-FSL method establishes new
state-of-the-art performance across ten diverse benchmarks, including standard,
cross-domain, and fine-grained few-shot learning scenarios. Code is available
at https://github.com/peacelwh/VT-FSL.
comment: Accepted by NeurIPS 2025
♻ ☆ Video Consistency Distance: Enhancing Temporal Consistency for Image-to-Video Generation via Reward-Based Fine-Tuning
Reward-based fine-tuning of video diffusion models is an effective approach
to improve the quality of generated videos, as it can fine-tune models without
requiring real-world video datasets. However, it can sometimes be limited to
specific performances because conventional reward functions are mainly aimed at
enhancing the quality across the whole generated video sequence, such as
aesthetic appeal and overall consistency. Notably, the temporal consistency of
the generated video often suffers when applying previous approaches to
image-to-video (I2V) generation tasks. To address this limitation, we propose
Video Consistency Distance (VCD), a novel metric designed to enhance temporal
consistency, and fine-tune a model with the reward-based fine-tuning framework.
To achieve coherent temporal consistency relative to a conditioning image, VCD
is defined in the frequency space of video frame features to capture frame
information effectively through frequency-domain analysis. Experimental results
across multiple I2V datasets demonstrate that fine-tuning a video generation
model with VCD significantly enhances temporal consistency without degrading
other performance compared to the previous method.
comment: 17 pages
♻ ☆ Text-conditioned State Space Model For Domain-generalized Change Detection Visual Question Answering
The Earth's surface is constantly changing, and detecting these changes
provides valuable insights that benefit various aspects of human society. While
traditional change detection methods have been employed to detect changes from
bi-temporal images, these approaches typically require expert knowledge for
accurate interpretation. To enable broader and more flexible access to change
information by non-expert users, the task of Change Detection Visual Question
Answering (CDVQA) has been introduced. However, existing CDVQA methods have
been developed under the assumption that training and testing datasets share
similar distributions. This assumption does not hold in real-world
applications, where domain shifts often occur. In this paper, the CDVQA task is
revisited with a focus on addressing domain shift. To this end, a new
multi-modal and multi-domain dataset, BrightVQA, is introduced to facilitate
domain generalization research in CDVQA. Furthermore, a novel state space
model, termed Text-Conditioned State Space Model (TCSSM), is proposed. The
TCSSM framework is designed to leverage both bi-temporal imagery and
geo-disaster-related textual information in an unified manner to extract
domain-invariant features across domains. Input-dependent parameters existing
in TCSSM are dynamically predicted by using both bi-temporal images and
geo-disaster-related description, thereby facilitating the alignment between
bi-temporal visual data and the associated textual descriptions. Extensive
experiments are conducted to evaluate the proposed method against
state-of-the-art models, and superior performance is consistently demonstrated.
The code and dataset will be made publicly available upon acceptance at
https://github.com/Elman295/TCSSM.
♻ ☆ Comprehensive Evaluation and Analysis for NSFW Concept Erasure in Text-to-Image Diffusion Models
Text-to-image diffusion models have gained widespread application across
various domains, demonstrating remarkable creative potential. However, the
strong generalization capabilities of diffusion models can inadvertently lead
to the generation of not-safe-for-work (NSFW) content, posing significant risks
to their safe deployment. While several concept erasure methods have been
proposed to mitigate the issue associated with NSFW content, a comprehensive
evaluation of their effectiveness across various scenarios remains absent. To
bridge this gap, we introduce a full-pipeline toolkit specifically designed for
concept erasure and conduct the first systematic study of NSFW concept erasure
methods. By examining the interplay between the underlying mechanisms and
empirical observations, we provide in-depth insights and practical guidance for
the effective application of concept erasure methods in various real-world
scenarios, with the aim of advancing the understanding of content safety in
diffusion models and establishing a solid foundation for future research and
development in this critical area.
♻ ☆ LucidFlux: Caption-Free Universal Image Restoration via a Large-Scale Diffusion Transformer
Universal image restoration (UIR) aims to recover images degraded by unknown
mixtures while preserving semantics -- conditions under which discriminative
restorers and UNet-based diffusion priors often oversmooth, hallucinate, or
drift. We present LucidFlux, a caption-free UIR framework that adapts a large
diffusion transformer (Flux.1) without image captions. LucidFlux introduces a
lightweight dual-branch conditioner that injects signals from the degraded
input and a lightly restored proxy to respectively anchor geometry and suppress
artifacts. Then, a timestep- and layer-adaptive modulation schedule is designed
to route these cues across the backbone's hierarchy, in order to yield
coarse-to-fine and context-aware updates that protect the global structure
while recovering texture. After that, to avoid the latency and instability of
text prompts or MLLM captions, we enforce caption-free semantic alignment via
SigLIP features extracted from the proxy. A scalable curation pipeline further
filters large-scale data for structure-rich supervision. Across synthetic and
in-the-wild benchmarks, LucidFlux consistently outperforms strong open-source
and commercial baselines, and ablation studies verify the necessity of each
component. LucidFlux shows that, for large DiTs, when, where, and what to
condition on -- rather than adding parameters or relying on text prompts -- is
the governing lever for robust and caption-free universal image restoration in
the wild.
comment: Project Page: https://w2genai-lab.github.io/LucidFlux
♻ ☆ CBDiff:Conditional Bernoulli Diffusion Models for Image Forgery Localization
Image Forgery Localization (IFL) is a crucial task in image forensics, aimed
at accurately identifying manipulated or tampered regions within an image at
the pixel level. Existing methods typically generate a single deterministic
localization map, which often lacks the precision and reliability required for
high-stakes applications such as forensic analysis and security surveillance.
To enhance the credibility of predictions and mitigate the risk of errors, we
introduce an advanced Conditional Bernoulli Diffusion Model (CBDiff). Given a
forged image, CBDiff generates multiple diverse and plausible localization
maps, thereby offering a richer and more comprehensive representation of the
forgery distribution. This approach addresses the uncertainty and variability
inherent in tampered regions. Furthermore, CBDiff innovatively incorporates
Bernoulli noise into the diffusion process to more faithfully reflect the
inherent binary and sparse properties of forgery masks. Additionally, CBDiff
introduces a Time-Step Cross-Attention (TSCAttention), which is specifically
designed to leverage semantic feature guidance with temporal steps to improve
manipulation detection. Extensive experiments on eight publicly benchmark
datasets demonstrate that CBDiff significantly outperforms existing
state-of-the-art methods, highlighting its strong potential for real-world
deployment.
♻ ☆ PlantSegNeRF: A few-shot, cross-species method for plant 3D instance point cloud reconstruction via joint-channel NeRF with multi-view image instance matching
Xin Yang, Ruiming Du, Hanyang Huang, Jiayang Xie, Pengyao Xie, Leisen Fang, Ziyue Guo, Nanjun Jiang, Yu Jiang, Haiyan Cen
Organ segmentation of plant point clouds is a prerequisite for the
high-resolution and accurate extraction of organ-level phenotypic traits.
Although the fast development of deep learning has boosted much research on
segmentation of plant point clouds, the existing techniques for organ
segmentation still face limitations in resolution, segmentation accuracy, and
generalizability across various plant species. In this study, we proposed a
novel approach called plant segmentation neural radiance fields (PlantSegNeRF),
aiming to directly generate high-precision instance point clouds from
multi-view RGB image sequences for a wide range of plant species. PlantSegNeRF
performed 2D instance segmentation on the multi-view images to generate
instance masks for each organ with a corresponding ID. The multi-view instance
IDs corresponding to the same plant organ were then matched and refined using a
specially designed instance matching module. The instance NeRF was developed to
render an implicit scene, containing color, density, semantic and instance
information. The implicit scene was ultimately converted into high-precision
plant instance point clouds based on the volume density. The results proved
that in semantic segmentation of point clouds, PlantSegNeRF outperformed the
commonly used methods, demonstrating an average improvement of 16.1%, 18.3%,
17.8%, and 24.2% in precision, recall, F1-score, and IoU compared to the
second-best results on structurally complex species. More importantly,
PlantSegNeRF exhibited significant advantages in plant point cloud instance
segmentation tasks. Across all plant species, it achieved average improvements
of 11.7%, 38.2%, 32.2% and 25.3% in mPrec, mRec, mCov, mWCov, respectively.
This study extends the organ-level plant phenotyping and provides a
high-throughput way to supply high-quality 3D data for the development of
large-scale models in plant science.
♻ ☆ FerretNet: Efficient Synthetic Image Detection via Local Pixel Dependencies NeurIPS 2025
The increasing realism of synthetic images generated by advanced models such
as VAEs, GANs, and LDMs poses significant challenges for synthetic image
detection. To address this issue, we explore two artifact types introduced
during the generation process: (1) latent distribution deviations and (2)
decoding-induced smoothing effects, which manifest as inconsistencies in local
textures, edges, and color transitions. Leveraging local pixel dependencies
(LPD) properties rooted in Markov Random Fields, we reconstruct synthetic
images using neighboring pixel information to expose disruptions in texture
continuity and edge coherence. Building upon LPD, we propose FerretNet, a
lightweight neural network with only 1.1M parameters that delivers efficient
and robust synthetic image detection. Extensive experiments demonstrate that
FerretNet, trained exclusively on the 4-class ProGAN dataset, achieves an
average accuracy of 97.1% on an open-world benchmark comprising 22 generative
models. Our code and datasets are publicly available at
https://github.com/xigua7105/FerretNet.
comment: 9 pages, 4 figures, 8 tables, accepted at NeurIPS 2025
♻ ☆ OpenWorldSAM: Extending SAM2 for Universal Image Segmentation with Language Prompts
The ability to segment objects based on open-ended language prompts remains a
critical challenge, requiring models to ground textual semantics into precise
spatial masks while handling diverse and unseen categories. We present
OpenWorldSAM, a framework that extends the prompt-driven Segment Anything Model
v2 (SAM2) to open-vocabulary scenarios by integrating multi-modal embeddings
extracted from a lightweight vision-language model (VLM). Our approach is
guided by four key principles: i) Unified prompting: OpenWorldSAM supports a
diverse range of prompts, including category-level and sentence-level language
descriptions, providing a flexible interface for various segmentation tasks.
ii) Efficiency: By freezing the pre-trained components of SAM2 and the VLM, we
train only 4.5 million parameters on the COCO-stuff dataset, achieving
remarkable resource efficiency. iii) Instance Awareness: We enhance the model's
spatial understanding through novel positional tie-breaker embeddings and
cross-attention layers, enabling effective segmentation of multiple instances.
iv) Generalization: OpenWorldSAM exhibits strong zero-shot capabilities,
generalizing well on unseen categories and an open vocabulary of concepts
without additional training. Extensive experiments demonstrate that
OpenWorldSAM achieves state-of-the-art performance in open-vocabulary semantic,
instance, and panoptic segmentation across multiple benchmarks. Code is
available at https://github.com/GinnyXiao/OpenWorldSAM.
♻ ☆ Generative diffusion model surrogates for mechanistic agent-based biological models
Tien Comlekoglu, J. Quetzalcoatl Toledo-Marín, Douglas W. DeSimone, Shayn M. Peirce, Geoffrey Fox, James A. Glazier
Mechanistic, multicellular, agent-based models are commonly used to
investigate tissue, organ, and organism-scale biology at single-cell
resolution. The Cellular-Potts Model (CPM) is a powerful and popular framework
for developing and interrogating these models. CPMs become computationally
expensive at large space- and time- scales making application and investigation
of developed models difficult. Surrogate models may allow for the accelerated
evaluation of CPMs of complex biological systems. However, the stochastic
nature of these models means each set of parameters may give rise to different
model configurations, complicating surrogate model development. In this work,
we leverage denoising diffusion probabilistic models to train a generative AI
surrogate of a CPM used to investigate in vitro vasculogenesis. We describe the
use of an image classifier to learn the characteristics that define unique
areas of a 2-dimensional parameter space. We then apply this classifier to aid
in surrogate model selection and verification. Our CPM model surrogate
generates model configurations 20,000 timesteps ahead of a reference
configuration and demonstrates approximately a 22x reduction in computational
time as compared to native code execution. Our work represents a step towards
the implementation of DDPMs to develop digital twins of stochastic biological
systems.
♻ ☆ Sherlock: Self-Correcting Reasoning in Vision-Language Models NeurIPS 2025
Reasoning Vision-Language Models (VLMs) have shown promising performance on
complex multimodal tasks. However, they still face significant challenges: they
are highly sensitive to reasoning errors, require large volumes of annotated
data or accurate verifiers, and struggle to generalize beyond specific domains.
To address these limitations, we explore self-correction as a strategy to
enhance reasoning VLMs. We first conduct an in-depth analysis of reasoning
VLMs' self-correction abilities and identify key gaps. Based on our findings,
we introduce Sherlock, a self-correction and self-improvement training
framework. Sherlock introduces a trajectory-level self-correction objective, a
preference data construction method based on visual perturbation, and a dynamic
$\beta$ for preference tuning. Once the model acquires self-correction
capabilities using only 20k randomly sampled annotated data, it continues to
self-improve without external supervision. Built on the Llama3.2-Vision-11B
model, Sherlock achieves remarkable results across eight benchmarks, reaching
an average accuracy of 64.1 with direct generation and 65.4 after
self-correction. It outperforms LLaVA-CoT (63.2), Mulberry (63.9), and
LlamaV-o1 (63.4) while using less than 20% of the annotated data.
comment: Published at NeurIPS 2025, 27 pages
♻ ☆ REOrdering Patches Improves Vision Models NeurIPS 2025
Sequence models such as transformers require inputs to be represented as
one-dimensional sequences. In vision, this typically involves flattening images
using a fixed row-major (raster-scan) order. While full self-attention is
permutation-equivariant, modern long-sequence transformers increasingly rely on
architectural approximations that break this invariance and introduce
sensitivity to patch ordering. We show that patch order significantly affects
model performance in such settings, with simple alternatives like column-major
or Hilbert curves yielding notable accuracy shifts. Motivated by this, we
propose REOrder, a two-stage framework for discovering task-optimal patch
orderings. First, we derive an information-theoretic prior by evaluating the
compressibility of various patch sequences. Then, we learn a policy over
permutations by optimizing a Plackett-Luce policy using REINFORCE. This
approach enables efficient learning in a combinatorial permutation space.
REOrder improves top-1 accuracy over row-major ordering on ImageNet-1K by up to
3.01% and Functional Map of the World by 13.35%.
comment: Accepted to the 39th Conference on Neural Information Processing
Systems (NeurIPS 2025)
♻ ☆ Spiking Neural Networks Need High Frequency Information
Spiking Neural Networks promise brain-inspired and energy-efficient
computation by transmitting information through binary (0/1) spikes. Yet, their
performance still lags behind that of artificial neural networks, often assumed
to result from information loss caused by sparse and binary activations. In
this work, we challenge this long-standing assumption and reveal a previously
overlooked frequency bias: spiking neurons inherently suppress high-frequency
components and preferentially propagate low-frequency information. This
frequency-domain imbalance, we argue, is the root cause of degraded feature
representation in SNNs. Empirically, on Spiking Transformers, adopting
Avg-Pooling (low-pass) for token mixing lowers performance to 76.73% on
Cifar-100, whereas replacing it with Max-Pool (high-pass) pushes the top-1
accuracy to 79.12%. Accordingly, we introduce Max-Former that restores
high-frequency signals through two frequency-enhancing operators: (1) extra
Max-Pool in patch embedding, and (2) Depth-Wise Convolution in place of
self-attention. Notably, Max-Former attains 82.39% top-1 accuracy on ImageNet
using only 63.99M parameters, surpassing Spikformer (74.81%, 66.34M) by +7.58%.
Extending our insight beyond transformers, our Max-ResNet-18 achieves
state-of-the-art performance on convolution-based benchmarks: 97.17% on
CIFAR-10 and 83.06% on CIFAR-100. We hope this simple yet effective solution
inspires future research to explore the distinctive nature of spiking neural
networks. Code is available: https://github.com/bic-L/MaxFormer.
♻ ☆ Epistemic-aware Vision-Language Foundation Model for Fetal Ultrasound Interpretation
Xiao He, Huangxuan Zhao, Guojia Wan, Wei Zhou, Yanxing Liu, Juhua Liu, Yongchao Xu, Yong Luo, Dacheng Tao, Bo Du
Recent medical vision-language models have shown promise on tasks such as
VQA, report generation, and anomaly detection. However, most are adapted to
structured adult imaging and underperform in fetal ultrasound, which poses
challenges of multi-view image reasoning, numerous diseases, and image
diversity. To bridge this gap, we introduce FetalMind, a medical AI system
tailored to fetal ultrasound for both report generation and diagnosis. Guided
by clinical workflow, we propose Salient Epistemic Disentanglement (SED), which
injects an expert-curated bipartite graph into the model to decouple
view-disease associations and to steer preference selection along clinically
faithful steps via reinforcement learning. This design mitigates variability
across diseases and heterogeneity across views, reducing learning bottlenecks
while aligning the model's inference with obstetric practice. To train
FetalMind at scale, we curate FetalSigma-1M dataset, the first large-scale
fetal ultrasound report corpus, comprising 20K reports from twelve medical
centers, addressing the scarcity of domain data. Extensive experiments show
that FetalMind outperforms open- and closed-source baselines across all
gestational stages, achieving +14% average gains and +61.2% higher accuracy on
critical conditions while remaining efficient, stable, and scalable. Project
Page: https://hexiao0275.github.io/FetalMind.
comment: This paper contains fundamental errors and will not be replaced
♻ ☆ Panoptic-CUDAL: Rural Australia Point Cloud Dataset in Rainy Conditions
Tzu-Yun Tseng, Alexey Nekrasov, Malcolm Burdorf, Bastian Leibe, Julie Stephany Berrio, Mao Shan, Zhenxing Ming, Stewart Worrall
Existing autonomous driving datasets are predominantly oriented towards
well-structured urban settings and favourable weather conditions, leaving the
complexities of rural environments and adverse weather conditions largely
unaddressed. Although some datasets encompass variations in weather and
lighting, bad weather scenarios do not appear often. Rainfall can significantly
impair sensor functionality, introducing noise and reflections in LiDAR and
camera data and reducing the system's capabilities for reliable environmental
perception and safe navigation. This paper introduces the Panoptic-CUDAL
dataset, a novel dataset purpose-built for panoptic segmentation in rural areas
subject to rain. By recording high-resolution LiDAR, camera, and pose data,
Panoptic-CUDAL offers a diverse, information-rich dataset in a challenging
scenario. We present the analysis of the recorded data and provide baseline
results for panoptic, semantic segmentation, and 3D occupancy prediction
methods on LiDAR point clouds. The dataset can be found here:
https://robotics.sydney.edu.au/our-research/intelligent-transportation-systems,
https://vision.rwth-aachen.de/panoptic-cudal
♻ ☆ SeG-SR: Integrating Semantic Knowledge into Remote Sensing Image Super-Resolution via Vision-Language Model
High-resolution (HR) remote sensing imagery plays a vital role in a wide
range of applications, including urban planning and environmental monitoring.
However, due to limitations in sensors and data transmission links, the images
acquired in practice often suffer from resolution degradation. Remote Sensing
Image Super-Resolution (RSISR) aims to reconstruct HR images from
low-resolution (LR) inputs, providing a cost-effective and efficient
alternative to direct HR image acquisition. Existing RSISR methods primarily
focus on low-level characteristics in pixel space, while neglecting the
high-level understanding of remote sensing scenes. This may lead to
semantically inconsistent artifacts in the reconstructed results. Motivated by
this observation, our work aims to explore the role of high-level semantic
knowledge in improving RSISR performance. We propose a Semantic-Guided
Super-Resolution framework, SeG-SR, which leverages Vision-Language Models
(VLMs) to extract semantic knowledge from input images and uses it to guide the
super resolution (SR) process. Specifically, we first design a Semantic Feature
Extraction Module (SFEM) that utilizes a pretrained VLM to extract semantic
knowledge from remote sensing images. Next, we propose a Semantic Localization
Module (SLM), which derives a series of semantic guidance from the extracted
semantic knowledge. Finally, we develop a Learnable Modulation Module (LMM)
that uses semantic guidance to modulate the features extracted by the SR
network, effectively incorporating high-level scene understanding into the SR
pipeline. We validate the effectiveness and generalizability of SeG-SR through
extensive experiments: SeG-SR achieves state-of-the-art performance on three
datasets, and consistently improves performance across various SR
architectures. Notably, for the x4 SR task on UCMerced dataset, it attained a
PSNR of 29.3042 dB and an SSIM of 0.7961.
♻ ☆ RADAR: A Risk-Aware Dynamic Multi-Agent Framework for LLM Safety Evaluation via Role-Specialized Collaboration
Xiuyuan Chen, Jian Zhao, Yuchen Yuan, Tianle Zhang, Huilin Zhou, Zheng Zhu, Ping Hu, Linghe Kong, Chi Zhang, Weiran Huang, Xuelong Li
Existing safety evaluation methods for large language models (LLMs) suffer
from inherent limitations, including evaluator bias and detection failures
arising from model homogeneity, which collectively undermine the robustness of
risk evaluation processes. This paper seeks to re-examine the risk evaluation
paradigm by introducing a theoretical framework that reconstructs the
underlying risk concept space. Specifically, we decompose the latent risk
concept space into three mutually exclusive subspaces: the explicit risk
subspace (encompassing direct violations of safety guidelines), the implicit
risk subspace (capturing potential malicious content that requires contextual
reasoning for identification), and the non-risk subspace. Furthermore, we
propose RADAR, a multi-agent collaborative evaluation framework that leverages
multi-round debate mechanisms through four specialized complementary roles and
employs dynamic update mechanisms to achieve self-evolution of risk concept
distributions. This approach enables comprehensive coverage of both explicit
and implicit risks while mitigating evaluator bias. To validate the
effectiveness of our framework, we construct an evaluation dataset comprising
800 challenging cases. Extensive experiments on our challenging testset and
public benchmarks demonstrate that RADAR significantly outperforms baseline
evaluation methods across multiple dimensions, including accuracy, stability,
and self-evaluation risk sensitivity. Notably, RADAR achieves a 28.87%
improvement in risk identification accuracy compared to the strongest baseline
evaluation method.
♻ ☆ FuseUNet: A Multi-Scale Feature Fusion Method for U-like Networks
Medical image segmentation is a critical task in computer vision, with UNet
serving as a milestone architecture. The typical component of UNet family is
the skip connection, however, their skip connections face two significant
limitations: (1) they lack effective interaction between features at different
scales, and (2) they rely on simple concatenation or addition operations, which
constrain efficient information integration. While recent improvements to UNet
have focused on enhancing encoder and decoder capabilities, these limitations
remain overlooked. To overcome these challenges, we propose a novel multi-scale
feature fusion method that reimagines the UNet decoding process as solving an
initial value problem (IVP), treating skip connections as discrete nodes. By
leveraging principles from the linear multistep method, we propose an adaptive
ordinary differential equation method to enable effective multi-scale feature
fusion. Our approach is independent of the encoder and decoder architectures,
making it adaptable to various U-Net-like networks. Experiments on ACDC,
KiTS2023, MSD brain tumor, and ISIC2017/2018 skin lesion segmentation datasets
demonstrate improved feature utilization, reduced network parameters, and
maintained high performance. The code is available at
https://github.com/nayutayuki/FuseUNet.
comment: Updated author information to clarify institutional affiliation. The
research was conducted prior to the author joining the University of Maryland
♻ ☆ A Renaissance of Explicit Motion Information Mining from Transformers for Action Recognition
Recently, action recognition has been dominated by transformer-based methods,
thanks to their spatiotemporal contextual aggregation capacities. However,
despite the significant progress achieved on scene-related datasets, they do
not perform well on motion-sensitive datasets due to the lack of elaborate
motion modeling designs. Meanwhile, we observe that the widely-used cost volume
in traditional action recognition is highly similar to the affinity matrix
defined in self-attention, but equipped with powerful motion modeling
capacities. In light of this, we propose to integrate those effective motion
modeling properties into the existing transformer in a unified and neat way,
with the proposal of the Explicit Motion Information Mining module (EMIM). In
EMIM, we propose to construct the desirable affinity matrix in a cost volume
style, where the set of key candidate tokens is sampled from the query-based
neighboring area in the next frame in a sliding-window manner. Then, the
constructed affinity matrix is used to aggregate contextual information for
appearance modeling and is converted into motion features for motion modeling
as well. We validate the motion modeling capacities of our method on four
widely-used datasets, and our method performs better than existing
state-of-the-art approaches, especially on motion-sensitive datasets, i.e.,
Something-Something V1 & V2. Our project is available at
https://github.com/PeiqinZhuang/EMIM .
comment: accepted by Pattern Recognition. We have been always curious to see
whether our designs could be beneficial in other scenarios, such as embedding
it into the DiT model or 3D-VAE for video generation. If you are interested
in it, why not give it a shot?
♻ ☆ Learning To Defer To A Population With Limited Demonstrations IEEE
This paper addresses the critical data scarcity that hinders the practical
deployment of learning to defer (L2D) systems to the population. We introduce a
context-aware, semi-supervised framework that uses meta-learning to generate
expert-specific embeddings from only a few demonstrations. We demonstrate the
efficacy of a dual-purpose mechanism, where these embeddings are used first to
generate a large corpus of pseudo-labels for training, and subsequently to
enable on-the-fly adaptation to new experts at test-time. The experiment
results on three different datasets confirm that a model trained on these
synthetic labels rapidly approaches oracle-level performance, validating the
data efficiency of our approach. By resolving a key training bottleneck, this
work makes adaptive L2D systems more practical and scalable, paving the way for
human-AI collaboration in real-world environments. To facilitate
reproducibility and address implementation details not covered in the main
text, we provide our source code and training configurations at
https://github.com/nil123532/learning-to-defer-to-a-population-with-limited-demonstrations.
comment: Accepted to IEEE DICTA 2025 (poster). 7 pages, 2 figures
♻ ☆ Novel Class Discovery for Point Cloud Segmentation via Joint Learning of Causal Representation and Reasoning NeurIPS 2025
In this paper, we focus on Novel Class Discovery for Point Cloud Segmentation
(3D-NCD), aiming to learn a model that can segment unlabeled (novel) 3D classes
using only the supervision from labeled (base) 3D classes. The key to this task
is to setup the exact correlations between the point representations and their
base class labels, as well as the representation correlations between the
points from base and novel classes. A coarse or statistical correlation
learning may lead to the confusion in novel class inference. lf we impose a
causal relationship as a strong correlated constraint upon the learning
process, the essential point cloud representations that accurately correspond
to the classes should be uncovered. To this end, we introduce a structural
causal model (SCM) to re-formalize the 3D-NCD problem and propose a new method,
i.e., Joint Learning of Causal Representation and Reasoning. Specifically, we
first analyze hidden confounders in the base class representations and the
causal relationships between the base and novel classes through SCM. We devise
a causal representation prototype that eliminates confounders to capture the
causal representations of base classes. A graph structure is then used to model
the causal relationships between the base classes' causal representation
prototypes and the novel class prototypes, enabling causal reasoning from base
to novel classes. Extensive experiments and visualization results on 3D and 2D
NCD semantic segmentation demonstrate the superiorities of our method.
comment: Accepted by NeurIPS 2025
♻ ☆ Towards Physical Understanding in Video Generation: A 3D Point Regularization Approach
Yunuo Chen, Junli Cao, Vidit Goel, Sergei Korolev, Chenfanfu Jiang, Jian Ren, Sergey Tulyakov, Anil Kag
We present a novel video generation framework that integrates 3-dimensional
geometry and dynamic awareness. To achieve this, we augment 2D videos with 3D
point trajectories and align them in pixel space. The resulting 3D-aware video
dataset, PointVid, is then used to fine-tune a latent diffusion model, enabling
it to track 2D objects with 3D Cartesian coordinates. Building on this, we
regularize the shape and motion of objects in the video to eliminate undesired
artifacts, e.g., non-physical deformation. Consequently, we enhance the quality
of generated RGB videos and alleviate common issues like object morphing, which
are prevalent in current video models due to a lack of shape awareness. With
our 3D augmentation and regularization, our model is capable of handling
contact-rich scenarios such as task-oriented videos, where 3D information is
essential for perceiving shape and motion of interacting solids. Our method can
be seamlessly integrated into existing video diffusion models to improve their
visual plausibility.
comment: Project Page: \url{https://snap-research.github.io/PointVidGen/}