Computer Vision and Pattern Recognition 108
☆ LiDPM: Rethinking Point Diffusion for Lidar Scene Completion IEEE
Training diffusion models that work directly on lidar points at the scale of
outdoor scenes is challenging due to the difficulty of generating fine-grained
details from white noise over a broad field of view. The latest works
addressing scene completion with diffusion models tackle this problem by
reformulating the original DDPM as a local diffusion process. It contrasts with
the common practice of operating at the level of objects, where vanilla DDPMs
are currently used. In this work, we close the gap between these two lines of
work. We identify approximations in the local diffusion formulation, show that
they are not required to operate at the scene level, and that a vanilla DDPM
with a well-chosen starting point is enough for completion. Finally, we
demonstrate that our method, LiDPM, leads to better results in scene completion
on SemanticKITTI. The project page is https://astra-vision.github.io/LiDPM .
comment: Accepted to IEEE IV 2025
☆ Dynamic Camera Poses and Where to Find Them CVPR 2025
Annotating camera poses on dynamic Internet videos at scale is critical for
advancing fields like realistic video generation and simulation. However,
collecting such a dataset is difficult, as most Internet videos are unsuitable
for pose estimation. Furthermore, annotating dynamic Internet videos present
significant challenges even for state-of-theart methods. In this paper, we
introduce DynPose-100K, a large-scale dataset of dynamic Internet videos
annotated with camera poses. Our collection pipeline addresses filtering using
a carefully combined set of task-specific and generalist models. For pose
estimation, we combine the latest techniques of point tracking, dynamic
masking, and structure-from-motion to achieve improvements over the
state-of-the-art approaches. Our analysis and experiments demonstrate that
DynPose-100K is both large-scale and diverse across several key attributes,
opening up avenues for advancements in various downstream applications.
comment: Accepted to CVPR 2025. Project Page:
https://research.nvidia.com/labs/dir/dynpose-100k
☆ Token-Shuffle: Towards High-Resolution Image Generation with Autoregressive Models
Xu Ma, Peize Sun, Haoyu Ma, Hao Tang, Chih-Yao Ma, Jialiang Wang, Kunpeng Li, Xiaoliang Dai, Yujun Shi, Xuan Ju, Yushi Hu, Artsiom Sanakoyeu, Felix Juefei-Xu, Ji Hou, Junjiao Tian, Tao Xu, Tingbo Hou, Yen-Cheng Liu, Zecheng He, Zijian He, Matt Feiszli, Peizhao Zhang, Peter Vajda, Sam Tsai, Yun Fu
Autoregressive (AR) models, long dominant in language generation, are
increasingly applied to image synthesis but are often considered less
competitive than Diffusion-based models. A primary limitation is the
substantial number of image tokens required for AR models, which constrains
both training and inference efficiency, as well as image resolution. To address
this, we present Token-Shuffle, a novel yet simple method that reduces the
number of image tokens in Transformer. Our key insight is the dimensional
redundancy of visual vocabularies in Multimodal Large Language Models (MLLMs),
where low-dimensional visual codes from visual encoder are directly mapped to
high-dimensional language vocabularies. Leveraging this, we consider two key
operations: token-shuffle, which merges spatially local tokens along channel
dimension to decrease the input token number, and token-unshuffle, which
untangles the inferred tokens after Transformer blocks to restore the spatial
arrangement for output. Jointly training with textual prompts, our strategy
requires no additional pretrained text-encoder and enables MLLMs to support
extremely high-resolution image synthesis in a unified next-token prediction
way while maintaining efficient training and inference. For the first time, we
push the boundary of AR text-to-image generation to a resolution of 2048x2048
with gratifying generation performance. In GenAI-benchmark, our 2.7B model
achieves 0.77 overall score on hard prompts, outperforming AR models LlamaGen
by 0.18 and diffusion models LDM by 0.15. Exhaustive large-scale human
evaluations also demonstrate our prominent image generation ability in terms of
text-alignment, visual flaw, and visual appearance. We hope that Token-Shuffle
can serve as a foundational design for efficient high-resolution image
generation within MLLMs.
☆ The Fourth Monocular Depth Estimation Challenge CVPR
Anton Obukhov, Matteo Poggi, Fabio Tosi, Ripudaman Singh Arora, Jaime Spencer, Chris Russell, Simon Hadfield, Richard Bowden, Shuaihang Wang, Zhenxin Ma, Weijie Chen, Baobei Xu, Fengyu Sun, Di Xie, Jiang Zhu, Mykola Lavreniuk, Haining Guan, Qun Wu, Yupei Zeng, Chao Lu, Huanran Wang, Guangyuan Zhou, Haotian Zhang, Jianxiong Wang, Qiang Rao, Chunjie Wang, Xiao Liu, Zhiqiang Lou, Hualie Jiang, Yihao Chen, Rui Xu, Minglang Tan, Zihan Qin, Yifan Mao, Jiayang Liu, Jialei Xu, Yifan Yang, Wenbo Zhao, Junjun Jiang, Xianming Liu, Mingshuai Zhao, Anlong Ming, Wu Chen, Feng Xue, Mengying Yu, Shida Gao, Xiangfeng Wang, Gbenga Omotara, Ramy Farag, Jacket Demby, Seyed Mohamad Ali Tousi, Guilherme N DeSouza, Tuan-Anh Yang, Minh-Quang Nguyen, Thien-Phuc Tran, Albert Luginov, Muhammad Shahzad
This paper presents the results of the fourth edition of the Monocular Depth
Estimation Challenge (MDEC), which focuses on zero-shot generalization to the
SYNS-Patches benchmark, a dataset featuring challenging environments in both
natural and indoor settings. In this edition, we revised the evaluation
protocol to use least-squares alignment with two degrees of freedom to support
disparity and affine-invariant predictions. We also revised the baselines and
included popular off-the-shelf methods: Depth Anything v2 and Marigold. The
challenge received a total of 24 submissions that outperformed the baselines on
the test set; 10 of these included a report describing their approach, with
most leading methods relying on affine-invariant predictions. The challenge
winners improved the 3D F-Score over the previous edition's best result,
raising it from 22.58% to 23.05%.
comment: To appear in CVPRW2025
☆ Step1X-Edit: A Practical Framework for General Image Editing
Shiyu Liu, Yucheng Han, Peng Xing, Fukun Yin, Rui Wang, Wei Cheng, Jiaqi Liao, Yingming Wang, Honghao Fu, Chunrui Han, Guopeng Li, Yuang Peng, Quan Sun, Jingwei Wu, Yan Cai, Zheng Ge, Ranchen Ming, Lei Xia, Xianfang Zeng, Yibo Zhu, Binxing Jiao, Xiangyu Zhang, Gang Yu, Daxin Jiang
In recent years, image editing models have witnessed remarkable and rapid
development. The recent unveiling of cutting-edge multimodal models such as
GPT-4o and Gemini2 Flash has introduced highly promising image editing
capabilities. These models demonstrate an impressive aptitude for fulfilling a
vast majority of user-driven editing requirements, marking a significant
advancement in the field of image manipulation. However, there is still a large
gap between the open-source algorithm with these closed-source models. Thus, in
this paper, we aim to release a state-of-the-art image editing model, called
Step1X-Edit, which can provide comparable performance against the closed-source
models like GPT-4o and Gemini2 Flash. More specifically, we adopt the
Multimodal LLM to process the reference image and the user's editing
instruction. A latent embedding has been extracted and integrated with a
diffusion image decoder to obtain the target image. To train the model, we
build a data generation pipeline to produce a high-quality dataset. For
evaluation, we develop the GEdit-Bench, a novel benchmark rooted in real-world
user instructions. Experimental results on GEdit-Bench demonstrate that
Step1X-Edit outperforms existing open-source baselines by a substantial margin
and approaches the performance of leading proprietary models, thereby making
significant contributions to the field of image editing.
comment: code: https://github.com/stepfun-ai/Step1X-Edit
☆ EgoCHARM: Resource-Efficient Hierarchical Activity Recognition using an Egocentric IMU Sensor
Akhil Padmanabha, Saravanan Govindarajan, Hwanmun Kim, Sergio Ortiz, Rahul Rajan, Doruk Senkal, Sneha Kadetotad
Human activity recognition (HAR) on smartglasses has various use cases,
including health/fitness tracking and input for context-aware AI assistants.
However, current approaches for egocentric activity recognition suffer from low
performance or are resource-intensive. In this work, we introduce a resource
(memory, compute, power, sample) efficient machine learning algorithm,
EgoCHARM, for recognizing both high level and low level activities using a
single egocentric (head-mounted) Inertial Measurement Unit (IMU). Our
hierarchical algorithm employs a semi-supervised learning strategy, requiring
primarily high level activity labels for training, to learn generalizable low
level motion embeddings that can be effectively utilized for low level activity
recognition. We evaluate our method on 9 high level and 3 low level activities
achieving 0.826 and 0.855 F1 scores on high level and low level activity
recognition respectively, with just 63k high level and 22k low level model
parameters, allowing the low level encoder to be deployed directly on current
IMU chips with compute. Lastly, we present results and insights from a
sensitivity analysis and highlight the opportunities and limitations of
activity recognition using egocentric IMUs.
☆ DPMambaIR:All-in-One Image Restoration via Degradation-Aware Prompt State Space Model
All-in-One image restoration aims to address multiple image degradation
problems using a single model, significantly reducing training costs and
deployment complexity compared to traditional methods that design dedicated
models for each degradation type. Existing approaches typically rely on
Degradation-specific models or coarse-grained degradation prompts to guide
image restoration. However, they lack fine-grained modeling of degradation
information and face limitations in balancing multi-task conflicts. To overcome
these limitations, we propose DPMambaIR, a novel All-in-One image restoration
framework. By integrating a Degradation-Aware Prompt State Space Model (DP-SSM)
and a High-Frequency Enhancement Block (HEB), DPMambaIR enables fine-grained
modeling of complex degradation information and efficient global integration,
while mitigating the loss of high-frequency details caused by task competition.
Specifically, the DP-SSM utilizes a pre-trained degradation extractor to
capture fine-grained degradation features and dynamically incorporates them
into the state space modeling process, enhancing the model's adaptability to
diverse degradation types. Concurrently, the HEB supplements high-frequency
information, effectively addressing the loss of critical details, such as edges
and textures, in multi-task image restoration scenarios. Extensive experiments
on a mixed dataset containing seven degradation types show that DPMambaIR
achieves the best performance, with 27.69dB and 0.893 in PSNR and SSIM,
respectively. These results highlight the potential and superiority of
DPMambaIR as a unified solution for All-in-One image restoration.
☆ CasualHDRSplat: Robust High Dynamic Range 3D Gaussian Splatting from Casually Captured Videos
Recently, photo-realistic novel view synthesis from multi-view images, such
as neural radiance field (NeRF) and 3D Gaussian Splatting (3DGS), have garnered
widespread attention due to their superior performance. However, most works
rely on low dynamic range (LDR) images, which limits the capturing of richer
scene details. Some prior works have focused on high dynamic range (HDR) scene
reconstruction, typically require capturing of multi-view sharp images with
different exposure times at fixed camera positions during exposure times, which
is time-consuming and challenging in practice. For a more flexible data
acquisition, we propose a one-stage method: \textbf{CasualHDRSplat} to easily
and robustly reconstruct the 3D HDR scene from casually captured videos with
auto-exposure enabled, even in the presence of severe motion blur and varying
unknown exposure time. \textbf{CasualHDRSplat} contains a unified
differentiable physical imaging model which first applies continuous-time
trajectory constraint to imaging process so that we can jointly optimize
exposure time, camera response function (CRF), camera poses, and sharp 3D HDR
scene. Extensive experiments demonstrate that our approach outperforms existing
methods in terms of robustness and rendering quality. Our source code will be
available at https://github.com/WU-CVGL/CasualHDRSplat
comment: Source Code: https://github.com/WU-CVGL/CasualHDRSplat
☆ Generative Fields: Uncovering Hierarchical Feature Control for StyleGAN via Inverted Receptive Fields
StyleGAN has demonstrated the ability of GANs to synthesize highly-realistic
faces of imaginary people from random noise. One limitation of GAN-based image
generation is the difficulty of controlling the features of the generated
image, due to the strong entanglement of the low-dimensional latent space.
Previous work that aimed to control StyleGAN with image or text prompts
modulated sampling in W latent space, which is more expressive than Z latent
space. However, W space still has restricted expressivity since it does not
control the feature synthesis directly; also the feature embedding in W space
requires a pre-training process to reconstruct the style signal, limiting its
application. This paper introduces the concept of "generative fields" to
explain the hierarchical feature synthesis in StyleGAN, inspired by the
receptive fields of convolution neural networks (CNNs). Additionally, we
propose a new image editing pipeline for StyleGAN using generative field theory
and the channel-wise style latent space S, utilizing the intrinsic structural
feature of CNNs to achieve disentangled control of feature synthesis at
synthesis time.
☆ Plasma State Monitoring and Disruption Characterization using Multimodal VAEs
Yoeri Poels, Alessandro Pau, Christian Donner, Giulio Romanelli, Olivier Sauter, Cristina Venturini, Vlado Menkovski, the TCV team, the WPTE team
When a plasma disrupts in a tokamak, significant heat and electromagnetic
loads are deposited onto the surrounding device components. These forces scale
with plasma current and magnetic field strength, making disruptions one of the
key challenges for future devices. Unfortunately, disruptions are not fully
understood, with many different underlying causes that are difficult to
anticipate. Data-driven models have shown success in predicting them, but they
only provide limited interpretability. On the other hand, large-scale
statistical analyses have been a great asset to understanding disruptive
patterns. In this paper, we leverage data-driven methods to find an
interpretable representation of the plasma state for disruption
characterization. Specifically, we use a latent variable model to represent
diagnostic measurements as a low-dimensional, latent representation. We build
upon the Variational Autoencoder (VAE) framework, and extend it for (1)
continuous projections of plasma trajectories; (2) a multimodal structure to
separate operating regimes; and (3) separation with respect to disruptive
regimes. Subsequently, we can identify continuous indicators for the disruption
rate and the disruptivity based on statistical properties of measurement data.
The proposed method is demonstrated using a dataset of approximately 1600 TCV
discharges, selecting for flat-top disruptions or regular terminations. We
evaluate the method with respect to (1) the identified disruption risk and its
correlation with other plasma properties; (2) the ability to distinguish
different types of disruptions; and (3) downstream analyses. For the latter, we
conduct a demonstrative study on identifying parameters connected to
disruptions using counterfactual-like analysis. Overall, the method can
adequately identify distinct operating regimes characterized by varying
proximity to disruptions in an interpretable manner.
☆ Hierarchical and Multimodal Data for Daily Activity Understanding
Ghazal Kaviani, Yavuz Yarici, Seulgi Kim, Mohit Prabhushankar, Ghassan AlRegib, Mashhour Solh, Ameya Patil
Daily Activity Recordings for Artificial Intelligence (DARai, pronounced
"Dahr-ree") is a multimodal, hierarchically annotated dataset constructed to
understand human activities in real-world settings. DARai consists of
continuous scripted and unscripted recordings of 50 participants in 10
different environments, totaling over 200 hours of data from 20 sensors
including multiple camera views, depth and radar sensors, wearable inertial
measurement units (IMUs), electromyography (EMG), insole pressure sensors,
biomonitor sensors, and gaze tracker.
To capture the complexity in human activities, DARai is annotated at three
levels of hierarchy: (i) high-level activities (L1) that are independent tasks,
(ii) lower-level actions (L2) that are patterns shared between activities, and
(iii) fine-grained procedures (L3) that detail the exact execution steps for
actions. The dataset annotations and recordings are designed so that 22.7% of
L2 actions are shared between L1 activities and 14.2% of L3 procedures are
shared between L2 actions. The overlap and unscripted nature of DARai allows
counterfactual activities in the dataset.
Experiments with various machine learning models showcase the value of DARai
in uncovering important challenges in human-centered applications.
Specifically, we conduct unimodal and multimodal sensor fusion experiments for
recognition, temporal localization, and future action anticipation across all
hierarchical annotation levels. To highlight the limitations of individual
sensors, we also conduct domain-variant experiments that are enabled by DARai's
multi-sensor and counterfactual activity design setup.
The code, documentation, and dataset are available at the dedicated DARai
website:
https://alregib.ece.gatech.edu/software-and-datasets/darai-daily-activity-recordings-for-artificial-intelligence-and-machine-learning/
☆ PICO: Reconstructing 3D People In Contact with Objects CVPR'25
Alpár Cseke, Shashank Tripathi, Sai Kumar Dwivedi, Arjun Lakshmipathy, Agniv Chatterjee, Michael J. Black, Dimitrios Tzionas
Recovering 3D Human-Object Interaction (HOI) from single color images is
challenging due to depth ambiguities, occlusions, and the huge variation in
object shape and appearance. Thus, past work requires controlled settings such
as known object shapes and contacts, and tackles only limited object classes.
Instead, we need methods that generalize to natural images and novel object
classes. We tackle this in two main ways: (1) We collect PICO-db, a new dataset
of natural images uniquely paired with dense 3D contact on both body and object
meshes. To this end, we use images from the recent DAMON dataset that are
paired with contacts, but these contacts are only annotated on a canonical 3D
body. In contrast, we seek contact labels on both the body and the object. To
infer these given an image, we retrieve an appropriate 3D object mesh from a
database by leveraging vision foundation models. Then, we project DAMON's body
contact patches onto the object via a novel method needing only 2 clicks per
patch. This minimal human input establishes rich contact correspondences
between bodies and objects. (2) We exploit our new dataset of contact
correspondences in a novel render-and-compare fitting method, called PICO-fit,
to recover 3D body and object meshes in interaction. PICO-fit infers contact
for the SMPL-X body, retrieves a likely 3D object mesh and contact from PICO-db
for that object, and uses the contact to iteratively fit the 3D body and object
meshes to image evidence via optimization. Uniquely, PICO-fit works well for
many object categories that no existing method can tackle. This is crucial to
enable HOI understanding to scale in the wild. Our data and code are available
at https://pico.is.tue.mpg.de.
comment: Accepted in CVPR'25. Project Page: https://pico.is.tue.mpg.de
☆ BIM-Constrained Optimization for Accurate Localization and Deviation Correction in Construction Monitoring
Asier Bikandi, Muhammad Shaheer, Hriday Bavle, Jayan Jevanesan, Holger Voos, Jose Luis Sanchez-Lopez
Augmented reality (AR) applications for construction monitoring rely on
real-time environmental tracking to visualize architectural elements. However,
construction sites present significant challenges for traditional tracking
methods due to featureless surfaces, dynamic changes, and drift accumulation,
leading to misalignment between digital models and the physical world. This
paper proposes a BIM-aware drift correction method to address these challenges.
Instead of relying solely on SLAM-based localization, we align ``as-built"
detected planes from the real-world environment with ``as-planned"
architectural planes in BIM. Our method performs robust plane matching and
computes a transformation (TF) between SLAM (S) and BIM (B) origin frames using
optimization techniques, minimizing drift over time. By incorporating BIM as
prior structural knowledge, we can achieve improved long-term localization and
enhanced AR visualization accuracy in noisy construction environments. The
method is evaluated through real-world experiments, showing significant
reductions in drift-induced errors and optimized alignment consistency. On
average, our system achieves a reduction of 52.24% in angular deviations and a
reduction of 60.8% in the distance error of the matched walls compared to the
initial manual alignment by the user.
☆ DiMeR: Disentangled Mesh Reconstruction Model
Lutao Jiang, Jiantao Lin, Kanghao Chen, Wenhang Ge, Xin Yang, Yifan Jiang, Yuanhuiyi Lyu, Xu Zheng, Yingcong Chen
With the advent of large-scale 3D datasets, feed-forward 3D generative
models, such as the Large Reconstruction Model (LRM), have gained significant
attention and achieved remarkable success. However, we observe that RGB images
often lead to conflicting training objectives and lack the necessary clarity
for geometry reconstruction. In this paper, we revisit the inductive biases
associated with mesh reconstruction and introduce DiMeR, a novel disentangled
dual-stream feed-forward model for sparse-view mesh reconstruction. The key
idea is to disentangle both the input and framework into geometry and texture
parts, thereby reducing the training difficulty for each part according to the
Principle of Occam's Razor. Given that normal maps are strictly consistent with
geometry and accurately capture surface variations, we utilize normal maps as
exclusive input for the geometry branch to reduce the complexity between the
network's input and output. Moreover, we improve the mesh extraction algorithm
to introduce 3D ground truth supervision. As for texture branch, we use RGB
images as input to obtain the textured mesh. Overall, DiMeR demonstrates robust
capabilities across various tasks, including sparse-view reconstruction,
single-image-to-3D, and text-to-3D. Numerous experiments show that DiMeR
significantly outperforms previous methods, achieving over 30% improvement in
Chamfer Distance on the GSO and OmniObject3D dataset.
comment: Project Page: https://lutao2021.github.io/DiMeR_page/
☆ Aerial Image Classification in Scarce and Unconstrained Environments via Conformal Prediction
This paper presents a comprehensive empirical analysis of conformal
prediction methods on a challenging aerial image dataset featuring diverse
events in unconstrained environments. Conformal prediction is a powerful
post-hoc technique that takes the output of any classifier and transforms it
into a set of likely labels, providing a statistical guarantee on the coverage
of the true label. Unlike evaluations on standard benchmarks, our study
addresses the complexities of data-scarce and highly variable real-world
settings. We investigate the effectiveness of leveraging pretrained models
(MobileNet, DenseNet, and ResNet), fine-tuned with limited labeled data, to
generate informative prediction sets. To further evaluate the impact of
calibration, we consider two parallel pipelines (with and without temperature
scaling) and assess performance using two key metrics: empirical coverage and
average prediction set size. This setup allows us to systematically examine how
calibration choices influence the trade-off between reliability and efficiency.
Our findings demonstrate that even with relatively small labeled samples and
simple nonconformity scores, conformal prediction can yield valuable
uncertainty estimates for complex tasks. Moreover, our analysis reveals that
while temperature scaling is often employed for calibration, it does not
consistently lead to smaller prediction sets, underscoring the importance of
careful consideration in its application. Furthermore, our results highlight
the significant potential of model compression techniques within the conformal
prediction pipeline for deployment in resource-constrained environments. Based
on our observations, we advocate for future research to delve into the impact
of noisy or ambiguous labels on conformal prediction performance and to explore
effective model reduction strategies.
comment: 17 pages, 5 figures, and 2 tables
☆ CLIPSE -- a minimalistic CLIP-based image search engine for research
A brief overview of CLIPSE, a self-hosted image search engine with the main
application of research, is provided. In general, CLIPSE uses CLIP embeddings
to process the images and also the text queries. The overall framework is
designed with simplicity to enable easy extension and usage. Two benchmark
scenarios are described and evaluated, covering indexing and querying time. It
is shown that CLIPSE is capable of handling smaller datasets; for larger
datasets, a distributed approach with several instances should be considered.
☆ A Guide to Structureless Visual Localization
Vojtech Panek, Qunjie Zhou, Yaqing Ding, Sérgio Agostinho, Zuzana Kukelova, Torsten Sattler, Laura Leal-Taixé
Visual localization algorithms, i.e., methods that estimate the camera pose
of a query image in a known scene, are core components of many applications,
including self-driving cars and augmented / mixed reality systems.
State-of-the-art visual localization algorithms are structure-based, i.e., they
store a 3D model of the scene and use 2D-3D correspondences between the query
image and 3D points in the model for camera pose estimation. While such
approaches are highly accurate, they are also rather inflexible when it comes
to adjusting the underlying 3D model after changes in the scene. Structureless
localization approaches represent the scene as a database of images with known
poses and thus offer a much more flexible representation that can be easily
updated by adding or removing images. Although there is a large amount of
literature on structure-based approaches, there is significantly less work on
structureless methods. Hence, this paper is dedicated to providing the, to the
best of our knowledge, first comprehensive discussion and comparison of
structureless methods. Extensive experiments show that approaches that use a
higher degree of classical geometric reasoning generally achieve higher pose
accuracy. In particular, approaches based on classical absolute or
semi-generalized relative pose estimation outperform very recent methods based
on pose regression by a wide margin. Compared with state-of-the-art
structure-based approaches, the flexibility of structureless methods comes at
the cost of (slightly) lower pose accuracy, indicating an interesting direction
for future work.
☆ Beyond Labels: Zero-Shot Diabetic Foot Ulcer Wound Segmentation with Self-attention Diffusion Models and the Potential for Text-Guided Customization
Abderrachid Hamrani, Daniela Leizaola, Renato Sousa, Jose P. Ponce, Stanley Mathis, David G. Armstrong, Anuradha Godavarty
Diabetic foot ulcers (DFUs) pose a significant challenge in healthcare,
requiring precise and efficient wound assessment to enhance patient outcomes.
This study introduces the Attention Diffusion Zero-shot Unsupervised System
(ADZUS), a novel text-guided diffusion model that performs wound segmentation
without relying on labeled training data. Unlike conventional deep learning
models, which require extensive annotation, ADZUS leverages zero-shot learning
to dynamically adapt segmentation based on descriptive prompts, offering
enhanced flexibility and adaptability in clinical applications. Experimental
evaluations demonstrate that ADZUS surpasses traditional and state-of-the-art
segmentation models, achieving an IoU of 86.68\% and the highest precision of
94.69\% on the chronic wound dataset, outperforming supervised approaches such
as FUSegNet. Further validation on a custom-curated DFU dataset reinforces its
robustness, with ADZUS achieving a median DSC of 75\%, significantly surpassing
FUSegNet's 45\%. The model's text-guided segmentation capability enables
real-time customization of segmentation outputs, allowing targeted analysis of
wound characteristics based on clinical descriptions. Despite its competitive
performance, the computational cost of diffusion-based inference and the need
for potential fine-tuning remain areas for future improvement. ADZUS represents
a transformative step in wound segmentation, providing a scalable, efficient,
and adaptable AI-driven solution for medical imaging.
comment: 12 pages, 8 figures, journal article
☆ Improving Open-World Object Localization by Discovering Background
Ashish Singh, Michael J. Jones, Kuan-Chuan Peng, Anoop Cherian, Moitreya Chatterjee, Erik Learned-Miller
Our work addresses the problem of learning to localize objects in an
open-world setting, i.e., given the bounding box information of a limited
number of object classes during training, the goal is to localize all objects,
belonging to both the training and unseen classes in an image, during
inference. Towards this end, recent work in this area has focused on improving
the characterization of objects either explicitly by proposing new objective
functions (localization quality) or implicitly using object-centric
auxiliary-information, such as depth information, pixel/region affinity map
etc. In this work, we address this problem by incorporating background
information to guide the learning of the notion of objectness. Specifically, we
propose a novel framework to discover background regions in an image and train
an object proposal network to not detect any objects in these regions. We
formulate the background discovery task as that of identifying image regions
that are not discriminative, i.e., those that are redundant and constitute low
information content. We conduct experiments on standard benchmarks to showcase
the effectiveness of our proposed approach and observe significant improvements
over the previous state-of-the-art approaches for this task.
☆ Enhancing CNNs robustness to occlusions with bioinspired filters for border completion
We exploit the mathematical modeling of the visual cortex mechanism for
border completion to define custom filters for CNNs. We see a consistent
improvement in performance, particularly in accuracy, when our modified LeNet 5
is tested with occluded MNIST images.
comment: Submitted to the 7th International Conference on Geometric Science of
Information
☆ The effects of Hessian eigenvalue spectral density type on the applicability of Hessian analysis to generalization capability assessment of neural networks
Hessians of neural network (NN) contain essential information about the
curvature of NN loss landscapes which can be used to estimate NN generalization
capabilities. We have previously proposed generalization criteria that rely on
the observation that Hessian eigenvalue spectral density (HESD) behaves
similarly for a wide class of NNs. This paper further studies their
applicability by investigating factors that can result in different types of
HESD. We conduct a wide range of experiments showing that HESD mainly has
positive eigenvalues (MP-HESD) for NN training and fine-tuning with various
optimizers on different datasets with different preprocessing and augmentation
procedures. We also show that mainly negative HESD (MN-HESD) is a consequence
of external gradient manipulation, indicating that the previously proposed
Hessian analysis methodology cannot be applied in such cases. We also propose
criteria and corresponding conditions to determine HESD type and estimate NN
generalization potential. These HESD types and previously proposed
generalization criteria are combined into a unified HESD analysis methodology.
Finally, we discuss how HESD changes during training, and show the occurrence
of quasi-singular (QS) HESD and its influence on the proposed methodology and
on the conventional assumptions about the relation between Hessian eigenvalues
and NN loss landscape curvature.
comment: 11 pages, 10 figures, 4 tables, 4 equations
☆ STCL:Curriculum learning Strategies for deep learning image steganography models
Aiming at the problems of poor quality of steganographic images and slow
network convergence of image steganography models based on deep learning, this
paper proposes a Steganography Curriculum Learning training strategy (STCL) for
deep learning image steganography models. So that only easy images are selected
for training when the model has poor fitting ability at the initial stage, and
gradually expand to more difficult images, the strategy includes a difficulty
evaluation strategy based on the teacher model and an knee point-based training
scheduling strategy. Firstly, multiple teacher models are trained, and the
consistency of the quality of steganographic images under multiple teacher
models is used as the difficulty score to construct the training subsets from
easy to difficult. Secondly, a training control strategy based on knee points
is proposed to reduce the possibility of overfitting on small training sets and
accelerate the training process. Experimental results on three large public
datasets, ALASKA2, VOC2012 and ImageNet, show that the proposed image
steganography scheme is able to improve the model performance under multiple
algorithmic frameworks, which not only has a high PSNR, SSIM score, and
decoding accuracy, but also the steganographic images generated by the model
under the training of the STCL strategy have a low steganography analysis
scores. You can find our code at
\href{https://github.com/chaos-boops/STCL}{https://github.com/chaos-boops/STCL}.
☆ Tamper-evident Image using JPEG Fixed Points
An intriguing phenomenon about JPEG compression has been observed since two
decades ago- after repeating JPEG compression and decompression, it leads to a
stable image that does not change anymore, which is a fixed point. In this
work, we prove the existence of fixed points in the essential JPEG procedures.
We analyze JPEG compression and decompression processes, revealing the
existence of fixed points that can be reached within a few iterations. These
fixed points are diverse and preserve the image's visual quality, ensuring
minimal distortion. This result is used to develop a method to create a
tamper-evident image from the original authentic image, which can expose
tampering operations by showing deviations from the fixed point image.
comment: 6 pages, 6 figures
☆ RGB-D Tracking via Hierarchical Modality Aggregation and Distribution Network
The integration of dual-modal features has been pivotal in advancing
RGB-Depth (RGB-D) tracking. However, current trackers are less efficient and
focus solely on single-level features, resulting in weaker robustness in fusion
and slower speeds that fail to meet the demands of real-world applications. In
this paper, we introduce a novel network, denoted as HMAD (Hierarchical
Modality Aggregation and Distribution), which addresses these challenges. HMAD
leverages the distinct feature representation strengths of RGB and depth
modalities, giving prominence to a hierarchical approach for feature
distribution and fusion, thereby enhancing the robustness of RGB-D tracking.
Experimental results on various RGB-D datasets demonstrate that HMAD achieves
state-of-the-art performance. Moreover, real-world experiments further validate
HMAD's capacity to effectively handle a spectrum of tracking challenges in
real-time scenarios.
☆ Occlusion-Aware Self-Supervised Monocular Depth Estimation for Weak-Texture Endoscopic Images
We propose a self-supervised monocular depth estimation network tailored for
endoscopic scenes, aiming to infer depth within the gastrointestinal tract from
monocular images. Existing methods, though accurate, typically assume
consistent illumination, which is often violated due to dynamic lighting and
occlusions caused by GI motility. These variations lead to incorrect geometric
interpretations and unreliable self-supervised signals, degrading depth
reconstruction quality. To address this, we introduce an occlusion-aware
self-supervised framework. First, we incorporate an occlusion mask for data
augmentation, generating pseudo-labels by simulating viewpoint-dependent
occlusion scenarios. This enhances the model's ability to learn robust depth
features under partial visibility. Second, we leverage semantic segmentation
guided by non-negative matrix factorization, clustering convolutional
activations to generate pseudo-labels in texture-deprived regions, thereby
improving segmentation accuracy and mitigating information loss from lighting
changes. Experimental results on the SCARED dataset show that our method
achieves state-of-the-art performance in self-supervised depth estimation.
Additionally, evaluations on the Endo-SLAM and SERV-CT datasets demonstrate
strong generalization across diverse endoscopic environments.
☆ Unsupervised Urban Land Use Mapping with Street View Contrastive Clustering and a Geographical Prior
Urban land use classification and mapping are critical for urban planning,
resource management, and environmental monitoring. Existing remote sensing
techniques often lack precision in complex urban environments due to the
absence of ground-level details. Unlike aerial perspectives, street view images
provide a ground-level view that captures more human and social activities
relevant to land use in complex urban scenes. Existing street view-based
methods primarily rely on supervised classification, which is challenged by the
scarcity of high-quality labeled data and the difficulty of generalizing across
diverse urban landscapes. This study introduces an unsupervised contrastive
clustering model for street view images with a built-in geographical prior, to
enhance clustering performance. When combined with a simple visual assignment
of the clusters, our approach offers a flexible and customizable solution to
land use mapping, tailored to the specific needs of urban planners. We
experimentally show that our method can generate land use maps from geotagged
street view image datasets of two cities. As our methodology relies on the
universal spatial coherence of geospatial data ("Tobler's law"), it can be
adapted to various settings where street view images are available, to enable
scalable, unsupervised land use mapping and updating. The code will be
available at https://github.com/lin102/CCGP.
comment: 11 pages, 7 figures, preprint version
☆ A Comprehensive Survey of Knowledge-Based Vision Question Answering Systems: The Lifecycle of Knowledge in Visual Reasoning Task
Knowledge-based Vision Question Answering (KB-VQA) extends general Vision
Question Answering (VQA) by not only requiring the understanding of visual and
textual inputs but also extensive range of knowledge, enabling significant
advancements across various real-world applications. KB-VQA introduces unique
challenges, including the alignment of heterogeneous information from diverse
modalities and sources, the retrieval of relevant knowledge from noisy or
large-scale repositories, and the execution of complex reasoning to infer
answers from the combined context. With the advancement of Large Language
Models (LLMs), KB-VQA systems have also undergone a notable transformation,
where LLMs serve as powerful knowledge repositories, retrieval-augmented
generators and strong reasoners. Despite substantial progress, no comprehensive
survey currently exists that systematically organizes and reviews the existing
KB-VQA methods. This survey aims to fill this gap by establishing a structured
taxonomy of KB-VQA approaches, and categorizing the systems into main stages:
knowledge representation, knowledge retrieval, and knowledge reasoning. By
exploring various knowledge integration techniques and identifying persistent
challenges, this work also outlines promising future research directions,
providing a foundation for advancing KB-VQA models and their applications.
comment: 20 pages, 5 figures, 4 tables
☆ When Gaussian Meets Surfel: Ultra-fast High-fidelity Radiance Field Rendering
We introduce Gaussian-enhanced Surfels (GESs), a bi-scale representation for
radiance field rendering, wherein a set of 2D opaque surfels with
view-dependent colors represent the coarse-scale geometry and appearance of
scenes, and a few 3D Gaussians surrounding the surfels supplement fine-scale
appearance details. The rendering with GESs consists of two passes -- surfels
are first rasterized through a standard graphics pipeline to produce depth and
color maps, and then Gaussians are splatted with depth testing and color
accumulation on each pixel order independently. The optimization of GESs from
multi-view images is performed through an elaborate coarse-to-fine procedure,
faithfully capturing rich scene appearance. The entirely sorting-free rendering
of GESs not only achieves very fast rates, but also produces view-consistent
images, successfully avoiding popping artifacts under view changes. The basic
GES representation can be easily extended to achieve anti-aliasing in rendering
(Mip-GES), boosted rendering speeds (Speedy-GES) and compact storage
(Compact-GES), and reconstruct better scene geometries by replacing 3D
Gaussians with 2D Gaussians (2D-GES). Experimental results show that GESs
advance the state-of-the-arts as a compelling representation for ultra-fast
high-fidelity radiance field rendering.
☆ An Explainable Nature-Inspired Framework for Monkeypox Diagnosis: Xception Features Combined with NGBoost and African Vultures Optimization Algorithm
The recent global spread of monkeypox, particularly in regions where it has
not historically been prevalent, has raised significant public health concerns.
Early and accurate diagnosis is critical for effective disease management and
control. In response, this study proposes a novel deep learning-based framework
for the automated detection of monkeypox from skin lesion images, leveraging
the power of transfer learning, dimensionality reduction, and advanced machine
learning techniques. We utilize the newly developed Monkeypox Skin Lesion
Dataset (MSLD), which includes images of monkeypox, chickenpox, and measles, to
train and evaluate our models. The proposed framework employs the Xception
architecture for deep feature extraction, followed by Principal Component
Analysis (PCA) for dimensionality reduction, and the Natural Gradient Boosting
(NGBoost) algorithm for classification. To optimize the model's performance and
generalization, we introduce the African Vultures Optimization Algorithm (AVOA)
for hyperparameter tuning, ensuring efficient exploration of the parameter
space. Our results demonstrate that the proposed AVOA-NGBoost model achieves
state-of-the-art performance, with an accuracy of 97.53%, F1-score of 97.72%
and an AUC of 97.47%. Additionally, we enhance model interpretability using
Grad-CAM and LIME techniques, providing insights into the decision-making
process and highlighting key features influencing classification. This
framework offers a highly precise and efficient diagnostic tool, potentially
aiding healthcare providers in early detection and diagnosis, particularly in
resource-constrained environments.
☆ Text-to-Image Alignment in Denoising-Based Models through Step Selection
Visual generative AI models often encounter challenges related to text-image
alignment and reasoning limitations. This paper presents a novel method for
selectively enhancing the signal at critical denoising steps, optimizing image
generation based on input semantics. Our approach addresses the shortcomings of
early-stage signal modifications, demonstrating that adjustments made at later
stages yield superior results. We conduct extensive experiments to validate the
effectiveness of our method in producing semantically aligned images on
Diffusion and Flow Matching model, achieving state-of-the-art performance. Our
results highlight the importance of a judicious choice of sampling stage to
improve performance and overall image alignment.
☆ ESDiff: Encoding Strategy-inspired Diffusion Model with Few-shot Learning for Color Image Inpainting
Image inpainting is a technique used to restore missing or damaged regions of
an image. Traditional methods primarily utilize information from adjacent
pixels for reconstructing missing areas, while they struggle to preserve
complex details and structures. Simultaneously, models based on deep learning
necessitate substantial amounts of training data. To address this challenge, an
encoding strategy-inspired diffusion model with few-shot learning for color
image inpainting is proposed in this paper. The main idea of this novel
encoding strategy is the deployment of a "virtual mask" to construct
high-dimensional objects through mutual perturbations between channels. This
approach enables the diffusion model to capture diverse image representations
and detailed features from limited training samples. Moreover, the encoding
strategy leverages redundancy between channels, integrates with low-rank
methods during iterative inpainting, and incorporates the diffusion model to
achieve accurate information output. Experimental results indicate that our
method exceeds current techniques in quantitative metrics, and the
reconstructed images quality has been improved in aspects of texture and
structural integrity, leading to more precise and coherent results.
comment: 11 pages,10 figures,Submit to tcsvt
☆ Towards One-Stage End-to-End Table Structure Recognition with Parallel Regression for Diverse Scenarios
Table structure recognition aims to parse tables in unstructured data into
machine-understandable formats. Recent methods address this problem through a
two-stage process or optimized one-stage approaches. However, these methods
either require multiple networks to be serially trained and perform more
time-consuming sequential decoding, or rely on complex post-processing
algorithms to parse the logical structure of tables. They struggle to balance
cross-scenario adaptability, robustness, and computational efficiency. In this
paper, we propose a one-stage end-to-end table structure parsing network called
TableCenterNet. This network unifies the prediction of table spatial and
logical structure into a parallel regression task for the first time, and
implicitly learns the spatial-logical location mapping laws of cells through a
synergistic architecture of shared feature extraction layers and task-specific
decoding. Compared with two-stage methods, our method is easier to train and
faster to infer. Experiments on benchmark datasets show that TableCenterNet can
effectively parse table structures in diverse scenarios and achieve
state-of-the-art performance on the TableGraph-24k dataset. Code is available
at https://github.com/dreamy-xay/TableCenterNet.
☆ Mamba-Sea: A Mamba-based Framework with Global-to-Local Sequence Augmentation for Generalizable Medical Image Segmentation IEEE
To segment medical images with distribution shifts, domain generalization
(DG) has emerged as a promising setting to train models on source domains that
can generalize to unseen target domains. Existing DG methods are mainly based
on CNN or ViT architectures. Recently, advanced state space models, represented
by Mamba, have shown promising results in various supervised medical image
segmentation. The success of Mamba is primarily owing to its ability to capture
long-range dependencies while keeping linear complexity with input sequence
length, making it a promising alternative to CNNs and ViTs. Inspired by the
success, in the paper, we explore the potential of the Mamba architecture to
address distribution shifts in DG for medical image segmentation. Specifically,
we propose a novel Mamba-based framework, Mamba-Sea, incorporating
global-to-local sequence augmentation to improve the model's generalizability
under domain shift issues. Our Mamba-Sea introduces a global augmentation
mechanism designed to simulate potential variations in appearance across
different sites, aiming to suppress the model's learning of domain-specific
information. At the local level, we propose a sequence-wise augmentation along
input sequences, which perturbs the style of tokens within random continuous
sub-sequences by modeling and resampling style statistics associated with
domain shifts. To our best knowledge, Mamba-Sea is the first work to explore
the generalization of Mamba for medical image segmentation, providing an
advanced and promising Mamba-based architecture with strong robustness to
domain shifts. Remarkably, our proposed method is the first to surpass a Dice
coefficient of 90% on the Prostate dataset, which exceeds previous SOTA of
88.61%. The code is available at https://github.com/orange-czh/Mamba-Sea.
comment: Accepted by IEEE TMI 2025. The code is available at
https://github.com/orange-czh/Mamba-Sea
☆ RefVNLI: Towards Scalable Evaluation of Subject-driven Text-to-image Generation
Aviv Slobodkin, Hagai Taitelbaum, Yonatan Bitton, Brian Gordon, Michal Sokolik, Nitzan Bitton Guetta, Almog Gueta, Royi Rassin, Itay Laish, Dani Lischinski, Idan Szpektor
Subject-driven text-to-image (T2I) generation aims to produce images that
align with a given textual description, while preserving the visual identity
from a referenced subject image. Despite its broad downstream applicability --
ranging from enhanced personalization in image generation to consistent
character representation in video rendering -- progress in this field is
limited by the lack of reliable automatic evaluation. Existing methods either
assess only one aspect of the task (i.e., textual alignment or subject
preservation), misalign with human judgments, or rely on costly API-based
evaluation. To address this, we introduce RefVNLI, a cost-effective metric that
evaluates both textual alignment and subject preservation in a single
prediction. Trained on a large-scale dataset derived from video-reasoning
benchmarks and image perturbations, RefVNLI outperforms or matches existing
baselines across multiple benchmarks and subject categories (e.g.,
\emph{Animal}, \emph{Object}), achieving up to 6.4-point gains in textual
alignment and 8.5-point gains in subject consistency. It also excels with
lesser-known concepts, aligning with human preferences at over 87\% accuracy.
☆ Enhanced Sample Selection with Confidence Tracking: Identifying Correctly Labeled yet Hard-to-Learn Samples in Noisy Data
We propose a novel sample selection method for image classification in the
presence of noisy labels. Existing methods typically consider small-loss
samples as correctly labeled. However, some correctly labeled samples are
inherently difficult for the model to learn and can exhibit high loss similar
to mislabeled samples in the early stages of training. Consequently, setting a
threshold on per-sample loss to select correct labels results in a trade-off
between precision and recall in sample selection: a lower threshold may miss
many correctly labeled hard-to-learn samples (low recall), while a higher
threshold may include many mislabeled samples (low precision). To address this
issue, our goal is to accurately distinguish correctly labeled yet
hard-to-learn samples from mislabeled ones, thus alleviating the trade-off
dilemma. We achieve this by considering the trends in model prediction
confidence rather than relying solely on loss values. Empirical observations
show that only for correctly labeled samples, the model's prediction confidence
for the annotated labels typically increases faster than for any other classes.
Based on this insight, we propose tracking the confidence gaps between the
annotated labels and other classes during training and evaluating their trends
using the Mann-Kendall Test. A sample is considered potentially correctly
labeled if all its confidence gaps tend to increase. Our method functions as a
plug-and-play component that can be seamlessly integrated into existing sample
selection techniques. Experiments on several standard benchmarks and real-world
datasets demonstrate that our method enhances the performance of existing
methods for learning with noisy labels.
☆ Unveiling Hidden Vulnerabilities in Digital Human Generation via Adversarial Attacks
Zhiying Li, Yeying Jin, Fan Shen, Zhi Liu, Weibin Chen, Pengju Zhang, Xiaomei Zhang, Boyu Chen, Michael Shen, Kejian Wu, Zhaoxin Fan, Jin Dong
Expressive human pose and shape estimation (EHPS) is crucial for digital
human generation, especially in applications like live streaming. While
existing research primarily focuses on reducing estimation errors, it largely
neglects robustness and security aspects, leaving these systems vulnerable to
adversarial attacks. To address this significant challenge, we propose the
\textbf{Tangible Attack (TBA)}, a novel framework designed to generate
adversarial examples capable of effectively compromising any digital human
generation model. Our approach introduces a \textbf{Dual Heterogeneous Noise
Generator (DHNG)}, which leverages Variational Autoencoders (VAE) and
ControlNet to produce diverse, targeted noise tailored to the original image
features. Additionally, we design a custom \textbf{adversarial loss function}
to optimize the noise, ensuring both high controllability and potent
disruption. By iteratively refining the adversarial sample through
multi-gradient signals from both the noise and the state-of-the-art EHPS model,
TBA substantially improves the effectiveness of adversarial attacks. Extensive
experiments demonstrate TBA's superiority, achieving a remarkable 41.0\%
increase in estimation error, with an average improvement of approximately
17.0\%. These findings expose significant security vulnerabilities in current
EHPS models and highlight the need for stronger defenses in digital human
generation systems.
comment: 14 pages, 7 figures
☆ FRAG: Frame Selection Augmented Generation for Long Video and Long Document Understanding
There has been impressive progress in Large Multimodal Models (LMMs). Recent
works extend these models to long inputs, including multi-page documents and
long videos. However, the model size and performance of these long context
models are still limited due to the computational cost in both training and
inference. In this work, we explore an orthogonal direction and process long
inputs without long context LMMs. We propose Frame Selection Augmented
Generation (FRAG), where the model first selects relevant frames within the
input, and then only generates the final outputs based on the selected frames.
The core of the selection process is done by scoring each frame independently,
which does not require long context processing. The frames with the highest
scores are then selected by a simple Top-K selection. We show that this
frustratingly simple framework is applicable to both long videos and multi-page
documents using existing LMMs without any fine-tuning. We consider two models,
LLaVA-OneVision and InternVL2, in our experiments and show that FRAG
consistently improves the performance and achieves state-of-the-art
performances for both long video and long document understanding. For videos,
FRAG substantially improves InternVL2-76B by 5.8% on MLVU and 3.7% on
Video-MME. For documents, FRAG achieves over 20% improvements on MP-DocVQA
compared with recent LMMs specialized in long document understanding. Code is
available at: https://github.com/NVlabs/FRAG
☆ Predict-Optimize-Distill: A Self-Improving Cycle for 4D Object Understanding
Humans can resort to long-form inspection to build intuition on predicting
the 3D configurations of unseen objects. The more we observe the object motion,
the better we get at predicting its 3D state immediately. Existing systems
either optimize underlying representations from multi-view observations or
train a feed-forward predictor from supervised datasets. We introduce
Predict-Optimize-Distill (POD), a self-improving framework that interleaves
prediction and optimization in a mutually reinforcing cycle to achieve better
4D object understanding with increasing observation time. Given a multi-view
object scan and a long-form monocular video of human-object interaction, POD
iteratively trains a neural network to predict local part poses from RGB
frames, uses this predictor to initialize a global optimization which refines
output poses through inverse rendering, then finally distills the results of
optimization back into the model by generating synthetic self-labeled training
data from novel viewpoints. Each iteration improves both the predictive model
and the optimized motion trajectory, creating a virtuous cycle that bootstraps
its own training data to learn about the pose configurations of an object. We
also introduce a quasi-multiview mining strategy for reducing depth ambiguity
by leveraging long video. We evaluate POD on 14 real-world and 5 synthetic
objects with various joint types, including revolute and prismatic joints as
well as multi-body configurations where parts detach or reattach independently.
POD demonstrates significant improvement over a pure optimization baseline
which gets stuck in local minima, particularly for longer videos. We also find
that POD's performance improves with both video length and successive
iterations of the self-improving cycle, highlighting its ability to scale
performance with additional observations and looped refinement.
comment: See our website at:
https://predict-optimize-distill.github.io/pod.github.io First two authors
contributed equally
☆ Breaking the Modality Barrier: Universal Embedding Learning with Multimodal LLMs
Tiancheng Gu, Kaicheng Yang, Ziyong Feng, Xingjun Wang, Yanzhao Zhang, Dingkun Long, Yingda Chen, Weidong Cai, Jiankang Deng
The Contrastive Language-Image Pre-training (CLIP) framework has become a
widely used approach for multimodal representation learning, particularly in
image-text retrieval and clustering. However, its efficacy is constrained by
three key limitations: (1) text token truncation, (2) isolated image-text
encoding, and (3) deficient compositionality due to bag-of-words behavior.
While recent Multimodal Large Language Models (MLLMs) have demonstrated
significant advances in generalized vision-language understanding, their
potential for learning transferable multimodal representations remains
underexplored.In this work, we present UniME (Universal Multimodal Embedding),
a novel two-stage framework that leverages MLLMs to learn discriminative
representations for diverse downstream tasks. In the first stage, we perform
textual discriminative knowledge distillation from a powerful LLM-based teacher
model to enhance the embedding capability of the MLLM\'s language component. In
the second stage, we introduce hard negative enhanced instruction tuning to
further advance discriminative representation learning. Specifically, we
initially mitigate false negative contamination and then sample multiple hard
negatives per instance within each batch, forcing the model to focus on
challenging samples. This approach not only improves discriminative power but
also enhances instruction-following ability in downstream tasks. We conduct
extensive experiments on the MMEB benchmark and multiple retrieval tasks,
including short and long caption retrieval and compositional retrieval. Results
demonstrate that UniME achieves consistent performance improvement across all
tasks, exhibiting superior discriminative and compositional capabilities.
comment: 13 pages, 8 figures, Project page: https://garygutc.github.io/UniME
☆ 3DV-TON: Textured 3D-Guided Consistent Video Try-on via Diffusion Models 3DV
Video try-on replaces clothing in videos with target garments. Existing
methods struggle to generate high-quality and temporally consistent results
when handling complex clothing patterns and diverse body poses. We present
3DV-TON, a novel diffusion-based framework for generating high-fidelity and
temporally consistent video try-on results. Our approach employs generated
animatable textured 3D meshes as explicit frame-level guidance, alleviating the
issue of models over-focusing on appearance fidelity at the expanse of motion
coherence. This is achieved by enabling direct reference to consistent garment
texture movements throughout video sequences. The proposed method features an
adaptive pipeline for generating dynamic 3D guidance: (1) selecting a keyframe
for initial 2D image try-on, followed by (2) reconstructing and animating a
textured 3D mesh synchronized with original video poses. We further introduce a
robust rectangular masking strategy that successfully mitigates artifact
propagation caused by leaking clothing information during dynamic human and
garment movements. To advance video try-on research, we introduce HR-VVT, a
high-resolution benchmark dataset containing 130 videos with diverse clothing
types and scenarios. Quantitative and qualitative results demonstrate our
superior performance over existing methods. The project page is at this link
https://2y7c3.github.io/3DV-TON/
comment: Project page: https://2y7c3.github.io/3DV-TON/
☆ StereoMamba: Real-time and Robust Intraoperative Stereo Disparity Estimation via Long-range Spatial Dependencies
Stereo disparity estimation is crucial for obtaining depth information in
robot-assisted minimally invasive surgery (RAMIS). While current deep learning
methods have made significant advancements, challenges remain in achieving an
optimal balance between accuracy, robustness, and inference speed. To address
these challenges, we propose the StereoMamba architecture, which is
specifically designed for stereo disparity estimation in RAMIS. Our approach is
based on a novel Feature Extraction Mamba (FE-Mamba) module, which enhances
long-range spatial dependencies both within and across stereo images. To
effectively integrate multi-scale features from FE-Mamba, we then introduce a
novel Multidimensional Feature Fusion (MFF) module. Experiments against the
state-of-the-art on the ex-vivo SCARED benchmark demonstrate that StereoMamba
achieves superior performance on EPE of 2.64 px and depth MAE of 2.55 mm, the
second-best performance on Bad2 of 41.49% and Bad3 of 26.99%, while maintaining
an inference speed of 21.28 FPS for a pair of high-resolution images
(1280*1024), striking the optimum balance between accuracy, robustness, and
efficiency. Furthermore, by comparing synthesized right images, generated from
warping left images using the generated disparity maps, with the actual right
image, StereoMamba achieves the best average SSIM (0.8970) and PSNR (16.0761),
exhibiting strong zero-shot generalization on the in-vivo RIS2017 and StereoMIS
datasets.
☆ S2S-Net: Addressing the Domain Gap of Heterogeneous Sensor Systems in LiDAR-Based Collective Perception
Collective Perception (CP) has emerged as a promising approach to overcome
the limitations of individual perception in the context of autonomous driving.
Various approaches have been proposed to realize collective perception;
however, the Sensor2Sensor domain gap that arises from the utilization of
different sensor systems in Connected and Automated Vehicles (CAVs) remains
mostly unaddressed. This is primarily due to the paucity of datasets containing
heterogeneous sensor setups among the CAVs. The recently released SCOPE
datasets address this issue by providing data from three different LiDAR
sensors for each CAV. This study is the first to tackle the Sensor2Sensor
domain gap in vehicle to vehicle (V2V) collective perception. First, we present
our sensor-domain robust architecture S2S-Net. Then an in-depth analysis of the
Sensor2Sensor domain adaptation capabilities of S2S-Net on the SCOPE dataset is
conducted. S2S-Net demonstrates the capability to maintain very high
performance in unseen sensor domains and achieved state-of-the-art results on
the SCOPE dataset.
☆ Fine-tune Smarter, Not Harder: Parameter-Efficient Fine-Tuning for Geospatial Foundation Models
Earth observation (EO) is crucial for monitoring environmental changes,
responding to disasters, and managing natural resources. In this context,
foundation models facilitate remote sensing image analysis to retrieve relevant
geoinformation accurately and efficiently. However, as these models grow in
size, fine-tuning becomes increasingly challenging due to the associated
computational resources and costs, limiting their accessibility and
scalability. Furthermore, full fine-tuning can lead to forgetting pre-trained
features and even degrade model generalization. To address this,
Parameter-Efficient Fine-Tuning (PEFT) techniques offer a promising solution.
In this paper, we conduct extensive experiments with various foundation model
architectures and PEFT techniques to evaluate their effectiveness on five
different EO datasets. Our results provide a comprehensive comparison, offering
insights into when and how PEFT methods support the adaptation of pre-trained
geospatial models. We demonstrate that PEFT techniques match or even exceed
full fine-tuning performance and enhance model generalisation to unseen
geographic regions, while reducing training time and memory requirements.
Additional experiments investigate the effect of architecture choices such as
the decoder type or the use of metadata, suggesting UNet decoders and
fine-tuning without metadata as the recommended configuration. We have
integrated all evaluated foundation models and techniques into the open-source
package TerraTorch to support quick, scalable, and cost-effective model
adaptation.
comment: Code available at https://github.com/IBM/peft-geofm
☆ SDVPT: Semantic-Driven Visual Prompt Tuning for Open-World Object Counting
Yiming Zhao, Guorong Li, Laiyun Qing, Amin Beheshti, Jian Yang, Michael Sheng, Yuankai Qi, Qingming Huang
Open-world object counting leverages the robust text-image alignment of
pre-trained vision-language models (VLMs) to enable counting of arbitrary
categories in images specified by textual queries. However, widely adopted
naive fine-tuning strategies concentrate exclusively on text-image consistency
for categories contained in training, which leads to limited generalizability
for unseen categories. In this work, we propose a plug-and-play Semantic-Driven
Visual Prompt Tuning framework (SDVPT) that transfers knowledge from the
training set to unseen categories with minimal overhead in parameters and
inference time. First, we introduce a two-stage visual prompt learning strategy
composed of Category-Specific Prompt Initialization (CSPI) and Topology-Guided
Prompt Refinement (TGPR). The CSPI generates category-specific visual prompts,
and then TGPR distills latent structural patterns from the VLM's text encoder
to refine these prompts. During inference, we dynamically synthesize the visual
prompts for unseen categories based on the semantic correlation between unseen
and training categories, facilitating robust text-image alignment for unseen
categories. Extensive experiments integrating SDVPT with all available
open-world object counting models demonstrate its effectiveness and
adaptability across three widely used datasets: FSC-147, CARPK, and PUCPR+.
☆ A Spatially-Aware Multiple Instance Learning Framework for Digital Pathology
Multiple instance learning (MIL) is a promising approach for weakly
supervised classification in pathology using whole slide images (WSIs).
However, conventional MIL methods such as Attention-Based Deep Multiple
Instance Learning (ABMIL) typically disregard spatial interactions among
patches that are crucial to pathological diagnosis. Recent advancements, such
as Transformer based MIL (TransMIL), have incorporated spatial context and
inter-patch relationships. However, it remains unclear whether explicitly
modeling patch relationships yields similar performance gains in ABMIL, which
relies solely on Multi-Layer Perceptrons (MLPs). In contrast, TransMIL employs
Transformer-based layers, introducing a fundamental architectural shift at the
cost of substantially increased computational complexity. In this work, we
enhance the ABMIL framework by integrating interaction-aware representations to
address this question. Our proposed model, Global ABMIL (GABMIL), explicitly
captures inter-instance dependencies while preserving computational efficiency.
Experimental results on two publicly available datasets for tumor subtyping in
breast and lung cancers demonstrate that GABMIL achieves up to a 7 percentage
point improvement in AUPRC and a 5 percentage point increase in the Kappa score
over ABMIL, with minimal or no additional computational overhead. These
findings underscore the importance of incorporating patch interactions within
MIL frameworks.
☆ Highly Accurate and Diverse Traffic Data: The DeepScenario Open 3D Dataset
Oussema Dhaouadi, Johannes Meier, Luca Wahl, Jacques Kaiser, Luca Scalerandi, Nick Wandelburg, Zhuolun Zhou, Nijanthan Berinpanathan, Holger Banzhaf, Daniel Cremers
Accurate 3D trajectory data is crucial for advancing autonomous driving. Yet,
traditional datasets are usually captured by fixed sensors mounted on a car and
are susceptible to occlusion. Additionally, such an approach can precisely
reconstruct the dynamic environment in the close vicinity of the measurement
vehicle only, while neglecting objects that are further away. In this paper, we
introduce the DeepScenario Open 3D Dataset (DSC3D), a high-quality,
occlusion-free dataset of 6 degrees of freedom bounding box trajectories
acquired through a novel monocular camera drone tracking pipeline. Our dataset
includes more than 175,000 trajectories of 14 types of traffic participants and
significantly exceeds existing datasets in terms of diversity and scale,
containing many unprecedented scenarios such as complex vehicle-pedestrian
interaction on highly populated urban streets and comprehensive parking
maneuvers from entry to exit. DSC3D dataset was captured in five various
locations in Europe and the United States and include: a parking lot, a crowded
inner-city, a steep urban intersection, a federal highway, and a suburban
intersection. Our 3D trajectory dataset aims to enhance autonomous driving
systems by providing detailed environmental 3D representations, which could
lead to improved obstacle interactions and safety. We demonstrate its utility
across multiple applications including motion prediction, motion planning,
scenario mining, and generative reactive traffic agents. Our interactive online
visualization platform and the complete dataset are publicly available at
app.deepscenario.com, facilitating research in motion prediction, behavior
modeling, and safety validation.
☆ TimeSoccer: An End-to-End Multimodal Large Language Model for Soccer Commentary Generation
Soccer is a globally popular sporting event, typically characterized by long
matches and distinctive highlight moments. Recent advances in Multimodal Large
Language Models (MLLMs) offer promising capabilities in temporal grounding and
video understanding, soccer commentary generation often requires precise
temporal localization and semantically rich descriptions over long-form video.
However, existing soccer MLLMs often rely on the temporal a priori for caption
generation, so they cannot process the soccer video end-to-end. While some
traditional approaches follow a two-step paradigm that is complex and fails to
capture the global context to achieve suboptimal performance. To solve the
above issues, we present TimeSoccer, the first end-to-end soccer MLLM for
Single-anchor Dense Video Captioning (SDVC) in full-match soccer videos.
TimeSoccer jointly predicts timestamps and generates captions in a single pass,
enabling global context modeling across 45-minute matches. To support long
video understanding of soccer matches, we introduce MoFA-Select, a
training-free, motion-aware frame compression module that adaptively selects
representative frames via a coarse-to-fine strategy, and incorporates
complementary training paradigms to strengthen the model's ability to handle
long temporal sequences. Extensive experiments demonstrate that our TimeSoccer
achieves State-of-The-Art (SoTA) performance on the SDVC task in an end-to-end
form, generating high-quality commentary with accurate temporal alignment and
strong semantic relevance.
☆ I-INR: Iterative Implicit Neural Representations
Ali Haider, Muhammad Salman Ali, Maryam Qamar, Tahir Khalil, Soo Ye Kim, Jihyong Oh, Enzo Tartaglione, Sung-Ho Bae
Implicit Neural Representations (INRs) have revolutionized signal processing
and computer vision by modeling signals as continuous, differentiable functions
parameterized by neural networks. However, their inherent formulation as a
regression problem makes them prone to regression to the mean, limiting their
ability to capture fine details, retain high-frequency information, and handle
noise effectively. To address these challenges, we propose Iterative Implicit
Neural Representations (I-INRs) a novel plug-and-play framework that enhances
signal reconstruction through an iterative refinement process. I-INRs
effectively recover high-frequency details, improve robustness to noise, and
achieve superior reconstruction quality. Our framework seamlessly integrates
with existing INR architectures, delivering substantial performance gains
across various tasks. Extensive experiments show that I-INRs outperform
baseline methods, including WIRE, SIREN, and Gauss, in diverse computer vision
applications such as image restoration, image denoising, and object occupancy
prediction.
☆ M-MRE: Extending the Mutual Reinforcement Effect to Multimodal Information Extraction
Chengguang Gan, Sunbowen Lee, Zhixi Cai, Yanbin Wei, Lei Zheng, Yunhao Liang, Shiwen Ni, Tatsunori Mori
Mutual Reinforcement Effect (MRE) is an emerging subfield at the intersection
of information extraction and model interpretability. MRE aims to leverage the
mutual understanding between tasks of different granularities, enhancing the
performance of both coarse-grained and fine-grained tasks through joint
modeling. While MRE has been explored and validated in the textual domain, its
applicability to visual and multimodal domains remains unexplored. In this
work, we extend MRE to the multimodal information extraction domain for the
first time. Specifically, we introduce a new task: Multimodal Mutual
Reinforcement Effect (M-MRE), and construct a corresponding dataset to support
this task. To address the challenges posed by M-MRE, we further propose a
Prompt Format Adapter (PFA) that is fully compatible with various Large
Vision-Language Models (LVLMs). Experimental results demonstrate that MRE can
also be observed in the M-MRE task, a multimodal text-image understanding
scenario. This provides strong evidence that MRE facilitates mutual gains
across three interrelated tasks, confirming its generalizability beyond the
textual domain.
☆ DRC: Enhancing Personalized Image Generation via Disentangled Representation Composition
Personalized image generation has emerged as a promising direction in
multimodal content creation. It aims to synthesize images tailored to
individual style preferences (e.g., color schemes, character appearances,
layout) and semantic intentions (e.g., emotion, action, scene contexts) by
leveraging user-interacted history images and multimodal instructions. Despite
notable progress, existing methods -- whether based on diffusion models, large
language models, or Large Multimodal Models (LMMs) -- struggle to accurately
capture and fuse user style preferences and semantic intentions. In particular,
the state-of-the-art LMM-based method suffers from the entanglement of visual
features, leading to Guidance Collapse, where the generated images fail to
preserve user-preferred styles or reflect the specified semantics.
To address these limitations, we introduce DRC, a novel personalized image
generation framework that enhances LMMs through Disentangled Representation
Composition. DRC explicitly extracts user style preferences and semantic
intentions from history images and the reference image, respectively, to form
user-specific latent instructions that guide image generation within LMMs.
Specifically, it involves two critical learning stages: 1) Disentanglement
learning, which employs a dual-tower disentangler to explicitly separate style
and semantic features, optimized via a reconstruction-driven paradigm with
difficulty-aware importance sampling; and 2) Personalized modeling, which
applies semantic-preserving augmentations to effectively adapt the disentangled
representations for robust personalized generation. Extensive experiments on
two benchmarks demonstrate that DRC shows competitive performance while
effectively mitigating the guidance collapse issue, underscoring the importance
of disentangled representation learning for controllable and effective
personalized image generation.
☆ TimeChat-Online: 80% Visual Tokens are Naturally Redundant in Streaming Videos
Linli Yao, Yicheng Li, Yuancheng Wei, Lei Li, Shuhuai Ren, Yuanxin Liu, Kun Ouyang, Lean Wang, Shicheng Li, Sida Li, Lingpeng Kong, Qi Liu, Yuanxing Zhang, Xu Sun
The rapid growth of online video platforms, particularly live streaming
services, has created an urgent need for real-time video understanding systems.
These systems must process continuous video streams and respond to user queries
instantaneously, presenting unique challenges for current Video Large Language
Models (VideoLLMs). While existing VideoLLMs excel at processing complete
videos, they face significant limitations in streaming scenarios due to their
inability to handle dense, redundant frames efficiently. We introduce
TimeChat-Online, a novel online VideoLLM that revolutionizes real-time video
interaction. At its core lies our innovative Differential Token Drop (DTD)
module, which addresses the fundamental challenge of visual redundancy in
streaming videos. Drawing inspiration from human visual perception's Change
Blindness phenomenon, DTD preserves meaningful temporal changes while filtering
out static, redundant content between frames. Remarkably, our experiments
demonstrate that DTD achieves an 82.8% reduction in video tokens while
maintaining 98% performance on StreamingBench, revealing that over 80% of
visual content in streaming videos is naturally redundant without requiring
language guidance. To enable seamless real-time interaction, we present
TimeChat-Online-139K, a comprehensive streaming video dataset featuring diverse
interaction patterns including backward-tracing, current-perception, and
future-responding scenarios. TimeChat-Online's unique Proactive Response
capability, naturally achieved through continuous monitoring of video scene
transitions via DTD, sets it apart from conventional approaches. Our extensive
evaluation demonstrates TimeChat-Online's superior performance on streaming
benchmarks (StreamingBench and OvOBench) and maintaining competitive results on
long-form video tasks such as Video-MME and MLVU.
☆ DIMT25@ICDAR2025: HW-TSC's End-to-End Document Image Machine Translation System Leveraging Large Vision-Language Model
Zhanglin Wu, Tengfei Song, Ning Xie, Weidong Zhang, Pengfei Li, Shuang Wu, Chong Li, Junhao Zhu, Hao Yang
This paper presents the technical solution proposed by Huawei Translation
Service Center (HW-TSC) for the "End-to-End Document Image Machine Translation
for Complex Layouts" competition at the 19th International Conference on
Document Analysis and Recognition (DIMT25@ICDAR2025). Leveraging
state-of-the-art open-source large vision-language model (LVLM), we introduce a
training framework that combines multi-task learning with perceptual
chain-of-thought to develop a comprehensive end-to-end document translation
system. During the inference phase, we apply minimum Bayesian decoding and
post-processing strategies to further enhance the system's translation
capabilities. Our solution uniquely addresses both OCR-based and OCR-free
document image translation tasks within a unified framework. This paper
systematically details the training methods, inference strategies, LVLM base
models, training data, experimental setups, and results, demonstrating an
effective approach to document image machine translation.
comment: 7 pages, 1 figures, 2 tables
☆ Class-Conditional Distribution Balancing for Group Robust Classification
Spurious correlations that lead models to correct predictions for the wrong
reasons pose a critical challenge for robust real-world generalization.
Existing research attributes this issue to group imbalance and addresses it by
maximizing group-balanced or worst-group accuracy, which heavily relies on
expensive bias annotations. A compromise approach involves predicting bias
information using extensively pretrained foundation models, which requires
large-scale data and becomes impractical for resource-limited rare domains. To
address these challenges, we offer a novel perspective by reframing the
spurious correlations as imbalances or mismatches in class-conditional
distributions, and propose a simple yet effective robust learning method that
eliminates the need for both bias annotations and predictions. With the goal of
reducing the mutual information between spurious factors and label information,
our method leverages a sample reweighting strategy to achieve class-conditional
distribution balancing, which automatically highlights minority groups and
classes, effectively dismantling spurious correlations and producing a debiased
data distribution for classification. Extensive experiments and analysis
demonstrate that our approach consistently delivers state-of-the-art
performance, rivaling methods that rely on bias supervision.
☆ Advanced Segmentation of Diabetic Retinopathy Lesions Using DeepLabv3+ IEEE
To improve the segmentation of diabetic retinopathy lesions (microaneurysms,
hemorrhages, exudates, and soft exudates), we implemented a binary segmentation
method specific to each type of lesion. As post-segmentation, we combined the
individual model outputs into a single image to better analyze the lesion
types. This approach facilitated parameter optimization and improved accuracy,
effectively overcoming challenges related to dataset limitations and annotation
complexity. Specific preprocessing steps included cropping and applying
contrast-limited adaptive histogram equalization to the L channel of the LAB
image. Additionally, we employed targeted data augmentation techniques to
further refine the model's efficacy. Our methodology utilized the DeepLabv3+
model, achieving a segmentation accuracy of 99%. These findings highlight the
efficacy of innovative strategies in advancing medical image analysis,
particularly in the precise segmentation of diabetic retinopathy lesions. The
IDRID dataset was utilized to validate and demonstrate the robustness of our
approach.
comment: This work was accepted at the ACS/IEEE International Conference on
Computer Systems and Applications (AICCSA) 2024
☆ EdgePoint2: Compact Descriptors for Superior Efficiency and Accuracy
The field of keypoint extraction, which is essential for vision applications
like Structure from Motion (SfM) and Simultaneous Localization and Mapping
(SLAM), has evolved from relying on handcrafted methods to leveraging deep
learning techniques. While deep learning approaches have significantly improved
performance, they often incur substantial computational costs, limiting their
deployment in real-time edge applications. Efforts to create lightweight neural
networks have seen some success, yet they often result in trade-offs between
efficiency and accuracy. Additionally, the high-dimensional descriptors
generated by these networks poses challenges for distributed applications
requiring efficient communication and coordination, highlighting the need for
compact yet competitively accurate descriptors. In this paper, we present
EdgePoint2, a series of lightweight keypoint detection and description neural
networks specifically tailored for edge computing applications on embedded
system. The network architecture is optimized for efficiency without
sacrificing accuracy. To train compact descriptors, we introduce a combination
of Orthogonal Procrustes loss and similarity loss, which can serve as a general
approach for hypersphere embedding distillation tasks. Additionally, we offer
14 sub-models to satisfy diverse application requirements. Our experiments
demonstrate that EdgePoint2 consistently achieves state-of-the-art (SOTA)
accuracy and efficiency across various challenging scenarios while employing
lower-dimensional descriptors (32/48/64). Beyond its accuracy, EdgePoint2
offers significant advantages in flexibility, robustness, and versatility.
Consequently, EdgePoint2 emerges as a highly competitive option for visual
tasks, especially in contexts demanding adaptability to diverse computational
and communication constraints.
☆ Towards Generalized and Training-Free Text-Guided Semantic Manipulation
Text-guided semantic manipulation refers to semantically editing an image
generated from a source prompt to match a target prompt, enabling the desired
semantic changes (e.g., addition, removal, and style transfer) while preserving
irrelevant contents. With the powerful generative capabilities of the diffusion
model, the task has shown the potential to generate high-fidelity visual
content. Nevertheless, existing methods either typically require time-consuming
fine-tuning (inefficient), fail to accomplish multiple semantic manipulations
(poorly extensible), and/or lack support for different modality tasks (limited
generalizability). Upon further investigation, we find that the geometric
properties of noises in the diffusion model are strongly correlated with the
semantic changes. Motivated by this, we propose a novel $\textit{GTF}$ for
text-guided semantic manipulation, which has the following attractive
capabilities: 1) $\textbf{Generalized}$: our $\textit{GTF}$ supports multiple
semantic manipulations (e.g., addition, removal, and style transfer) and can be
seamlessly integrated into all diffusion-based methods (i.e., Plug-and-play)
across different modalities (i.e., modality-agnostic); and 2)
$\textbf{Training-free}$: $\textit{GTF}$ produces high-fidelity results via
simply controlling the geometric relationship between noises without tuning or
optimization. Our extensive experiments demonstrate the efficacy of our
approach, highlighting its potential to advance the state-of-the-art in
semantics manipulation.
☆ Precision Neural Network Quantization via Learnable Adaptive Modules
Wenqiang Zhou, Zhendong Yu, Xinyu Liu, Jiaming Yang, Rong Xiao, Tao Wang, Chenwei Tang, Jiancheng Lv
Quantization Aware Training (QAT) is a neural network quantization technique
that compresses model size and improves operational efficiency while
effectively maintaining model performance. The paradigm of QAT is to introduce
fake quantization operators during the training process, allowing the model to
autonomously compensate for information loss caused by quantization. Making
quantization parameters trainable can significantly improve the performance of
QAT, but at the cost of compromising the flexibility during inference,
especially when dealing with activation values with substantially different
distributions. In this paper, we propose an effective learnable adaptive neural
network quantization method, called Adaptive Step Size Quantization (ASQ), to
resolve this conflict. Specifically, the proposed ASQ method first dynamically
adjusts quantization scaling factors through a trained module capable of
accommodating different activations. Then, to address the rigid resolution
issue inherent in Power of Two (POT) quantization, we propose an efficient
non-uniform quantization scheme. We utilize the Power Of Square root of Two
(POST) as the basis for exponential quantization, effectively handling the
bell-shaped distribution of neural network weights across various bit-widths
while maintaining computational efficiency through a Look-Up Table method
(LUT). Extensive experimental results demonstrate that the proposed ASQ method
is superior to the state-of-the-art QAT approaches. Notably that the ASQ is
even competitive compared to full precision baselines, with its 4-bit quantized
ResNet34 model improving accuracy by 1.2\% on ImageNet.
☆ Group Downsampling with Equivariant Anti-aliasing
Downsampling layers are crucial building blocks in CNN architectures, which
help to increase the receptive field for learning high-level features and
reduce the amount of memory/computation in the model. In this work, we study
the generalization of the uniform downsampling layer for group equivariant
architectures, e.g., G-CNNs. That is, we aim to downsample signals (feature
maps) on general finite groups with anti-aliasing. This involves the following:
(a) Given a finite group and a downsampling rate, we present an algorithm to
form a suitable choice of subgroup. (b) Given a group and a subgroup, we study
the notion of bandlimited-ness and propose how to perform anti-aliasing.
Notably, our method generalizes the notion of downsampling based on classical
sampling theory. When the signal is on a cyclic group, i.e., periodic, our
method recovers the standard downsampling of an ideal low-pass filter followed
by a subsampling operation. Finally, we conducted experiments on image
classification tasks demonstrating that the proposed downsampling operation
improves accuracy, better preserves equivariance, and reduces model size when
incorporated into G-equivariant networks
☆ DIVE: Inverting Conditional Diffusion Models for Discriminative Tasks IEEE
Diffusion models have shown remarkable progress in various generative tasks
such as image and video generation. This paper studies the problem of
leveraging pretrained diffusion models for performing discriminative tasks.
Specifically, we extend the discriminative capability of pretrained frozen
generative diffusion models from the classification task to the more complex
object detection task, by "inverting" a pretrained layout-to-image diffusion
model. To this end, a gradient-based discrete optimization approach for
replacing the heavy prediction enumeration process, and a prior distribution
model for making more accurate use of the Bayes' rule, are proposed
respectively. Empirical results show that this method is on par with basic
discriminative object detection baselines on COCO dataset. In addition, our
method can greatly speed up the previous diffusion-based method for
classification without sacrificing accuracy. Code and models are available at
https://github.com/LiYinqi/DIVE .
comment: Accepted by IEEE Transactions on Multimedia
☆ Scene Perceived Image Perceptual Score (SPIPS): combining global and local perception for image quality assessment
The rapid advancement of artificial intelligence and widespread use of
smartphones have resulted in an exponential growth of image data, both real
(camera-captured) and virtual (AI-generated). This surge underscores the
critical need for robust image quality assessment (IQA) methods that accurately
reflect human visual perception. Traditional IQA techniques primarily rely on
spatial features - such as signal-to-noise ratio, local structural distortions,
and texture inconsistencies - to identify artifacts. While effective for
unprocessed or conventionally altered images, these methods fall short in the
context of modern image post-processing powered by deep neural networks (DNNs).
The rise of DNN-based models for image generation, enhancement, and restoration
has significantly improved visual quality, yet made accurate assessment
increasingly complex. To address this, we propose a novel IQA approach that
bridges the gap between deep learning methods and human perception. Our model
disentangles deep features into high-level semantic information and low-level
perceptual details, treating each stream separately. These features are then
combined with conventional IQA metrics to provide a more comprehensive
evaluation framework. This hybrid design enables the model to assess both
global context and intricate image details, better reflecting the human visual
process, which first interprets overall structure before attending to
fine-grained elements. The final stage employs a multilayer perceptron (MLP) to
map the integrated features into a concise quality score. Experimental results
demonstrate that our method achieves improved consistency with human perceptual
judgments compared to existing IQA models.
☆ Range Image-Based Implicit Neural Compression for LiDAR Point Clouds
This paper presents a novel scheme to efficiently compress Light Detection
and Ranging~(LiDAR) point clouds, enabling high-precision 3D scene archives,
and such archives pave the way for a detailed understanding of the
corresponding 3D scenes. We focus on 2D range images~(RIs) as a lightweight
format for representing 3D LiDAR observations. Although conventional image
compression techniques can be adapted to improve compression efficiency for
RIs, their practical performance is expected to be limited due to differences
in bit precision and the distinct pixel value distribution characteristics
between natural images and RIs. We propose a novel implicit neural
representation~(INR)--based RI compression method that effectively handles
floating-point valued pixels. The proposed method divides RIs into depth and
mask images and compresses them using patch-wise and pixel-wise INR
architectures with model pruning and quantization, respectively. Experiments on
the KITTI dataset show that the proposed method outperforms existing image,
point cloud, RI, and INR-based compression methods in terms of 3D
reconstruction and detection quality at low bitrates and decoding latency.
☆ Visual and textual prompts for enhancing emotion recognition in video
Zhifeng Wang, Qixuan Zhang, Peter Zhang, Wenjia Niu, Kaihao Zhang, Ramesh Sankaranarayana, Sabrina Caldwell, Tom Gedeon
Vision Large Language Models (VLLMs) exhibit promising potential for
multi-modal understanding, yet their application to video-based emotion
recognition remains limited by insufficient spatial and contextual awareness.
Traditional approaches, which prioritize isolated facial features, often
neglect critical non-verbal cues such as body language, environmental context,
and social interactions, leading to reduced robustness in real-world scenarios.
To address this gap, we propose Set-of-Vision-Text Prompting (SoVTP), a novel
framework that enhances zero-shot emotion recognition by integrating spatial
annotations (e.g., bounding boxes, facial landmarks), physiological signals
(facial action units), and contextual cues (body posture, scene dynamics,
others' emotions) into a unified prompting strategy. SoVTP preserves holistic
scene information while enabling fine-grained analysis of facial muscle
movements and interpersonal dynamics. Extensive experiments show that SoVTP
achieves substantial improvements over existing visual prompting methods,
demonstrating its effectiveness in enhancing VLLMs' video emotion recognition
capabilities.
comment: 12 pages, 10 figures
☆ Towards Generalizable Deepfake Detection with Spatial-Frequency Collaborative Learning and Hierarchical Cross-Modal Fusion
The rapid evolution of deep generative models poses a critical challenge to
deepfake detection, as detectors trained on forgery-specific artifacts often
suffer significant performance degradation when encountering unseen forgeries.
While existing methods predominantly rely on spatial domain analysis, frequency
domain operations are primarily limited to feature-level augmentation, leaving
frequency-native artifacts and spatial-frequency interactions insufficiently
exploited. To address this limitation, we propose a novel detection framework
that integrates multi-scale spatial-frequency analysis for universal deepfake
detection. Our framework comprises three key components: (1) a local spectral
feature extraction pipeline that combines block-wise discrete cosine transform
with cascaded multi-scale convolutions to capture subtle spectral artifacts;
(2) a global spectral feature extraction pipeline utilizing scale-invariant
differential accumulation to identify holistic forgery distribution patterns;
and (3) a multi-stage cross-modal fusion mechanism that incorporates
shallow-layer attention enhancement and deep-layer dynamic modulation to model
spatial-frequency interactions. Extensive evaluations on widely adopted
benchmarks demonstrate that our method outperforms state-of-the-art deepfake
detection methods in both accuracy and generalizability.
☆ MCAF: Efficient Agent-based Video Understanding Framework through Multimodal Coarse-to-Fine Attention Focusing
Even in the era of rapid advances in large models, video understanding,
particularly long videos, remains highly challenging. Compared with textual or
image-based information, videos commonly contain more information with
redundancy, requiring large models to strategically allocate attention at a
global level for accurate comprehension. To address this, we propose MCAF, an
agent-based, training-free framework perform video understanding through
Multimodal Coarse-to-fine Attention Focusing. The key innovation lies in its
ability to sense and prioritize segments of the video that are highly relevant
to the understanding task. First, MCAF hierarchically concentrates on highly
relevant frames through multimodal information, enhancing the correlation
between the acquired contextual information and the query. Second, it employs a
dilated temporal expansion mechanism to mitigate the risk of missing crucial
details when extracting information from these concentrated frames. In
addition, our framework incorporates a self-reflection mechanism utilizing the
confidence level of the model's responses as feedback. By iteratively applying
these two creative focusing strategies, it adaptively adjusts attention to
capture highly query-connected context and thus improves response accuracy.
MCAF outperforms comparable state-of-the-art methods on average. On the
EgoSchema dataset, it achieves a remarkable 5% performance gain over the
leading approach. Meanwhile, on Next-QA and IntentQA datasets, it outperforms
the current state-of-the-art standard by 0.2% and 0.3% respectively. On the
Video-MME dataset, which features videos averaging nearly an hour in length,
MCAF also outperforms other agent-based methods.
☆ Perspective-Aware Reasoning in Vision-Language Models via Mental Imagery Simulation
We present a framework for perspective-aware reasoning in vision-language
models (VLMs) through mental imagery simulation. Perspective-taking, the
ability to perceive an environment or situation from an alternative viewpoint,
is a key benchmark for human-level visual understanding, essential for
environmental interaction and collaboration with autonomous agents. Despite
advancements in spatial reasoning within VLMs, recent research has shown that
modern VLMs significantly lack perspective-aware reasoning capabilities and
exhibit a strong bias toward egocentric interpretations. To bridge the gap
between VLMs and human perception, we focus on the role of mental imagery,
where humans perceive the world through abstracted representations that
facilitate perspective shifts. Motivated by this, we propose a framework for
perspective-aware reasoning, named Abstract Perspective Change (APC), that
effectively leverages vision foundation models, such as object detection,
segmentation, and orientation estimation, to construct scene abstractions and
enable perspective transformations. Our experiments on synthetic and real-image
benchmarks, compared with various VLMs, demonstrate significant improvements in
perspective-aware reasoning with our framework, further outperforming
fine-tuned spatial reasoning models and novel-view-synthesis-based approaches.
comment: Project Page: https://apc-vlm.github.io/
☆ We'll Fix it in Post: Improving Text-to-Video Generation with Neuro-Symbolic Feedback
Current text-to-video (T2V) generation models are increasingly popular due to
their ability to produce coherent videos from textual prompts. However, these
models often struggle to generate semantically and temporally consistent videos
when dealing with longer, more complex prompts involving multiple objects or
sequential events. Additionally, the high computational costs associated with
training or fine-tuning make direct improvements impractical. To overcome these
limitations, we introduce \(\projectname\), a novel zero-training video
refinement pipeline that leverages neuro-symbolic feedback to automatically
enhance video generation, achieving superior alignment with the prompts. Our
approach first derives the neuro-symbolic feedback by analyzing a formal video
representation and pinpoints semantically inconsistent events, objects, and
their corresponding frames. This feedback then guides targeted edits to the
original video. Extensive empirical evaluations on both open-source and
proprietary T2V models demonstrate that \(\projectname\) significantly enhances
temporal and logical alignment across diverse prompts by almost $40\%$.
☆ AUTHENTICATION: Identifying Rare Failure Modes in Autonomous Vehicle Perception Systems using Adversarially Guided Diffusion Models IEEE
Autonomous Vehicles (AVs) rely on artificial intelligence (AI) to accurately
detect objects and interpret their surroundings. However, even when trained
using millions of miles of real-world data, AVs are often unable to detect rare
failure modes (RFMs). The problem of RFMs is commonly referred to as the
"long-tail challenge", due to the distribution of data including many instances
that are very rarely seen. In this paper, we present a novel approach that
utilizes advanced generative and explainable AI techniques to aid in
understanding RFMs. Our methods can be used to enhance the robustness and
reliability of AVs when combined with both downstream model training and
testing. We extract segmentation masks for objects of interest (e.g., cars) and
invert them to create environmental masks. These masks, combined with carefully
crafted text prompts, are fed into a custom diffusion model. We leverage the
Stable Diffusion inpainting model guided by adversarial noise optimization to
generate images containing diverse environments designed to evade object
detection models and expose vulnerabilities in AI systems. Finally, we produce
natural language descriptions of the generated RFMs that can guide developers
and policymakers to improve the safety and reliability of AV systems.
comment: 8 pages, 10 figures. Accepted to IEEE Conference on Artificial
Intelligence (CAI), 2025
☆ A Genealogy of Multi-Sensor Foundation Models in Remote Sensing
Foundation models have garnered increasing attention for representation
learning in remote sensing, primarily adopting approaches that have
demonstrated success in computer vision with minimal domain-specific
modification. However, the development and application of foundation models in
this field are still burgeoning, as there are a variety of competing approaches
that each come with significant benefits and drawbacks. This paper examines
these approaches along with their roots in the computer vision field in order
to characterize potential advantages and pitfalls while outlining future
directions to further improve remote sensing-specific foundation models. We
discuss the quality of the learned representations and methods to alleviate the
need for massive compute resources. We place emphasis on the multi-sensor
aspect of Earth observations, and the extent to which existing approaches
leverage multiple sensors in training foundation models in relation to
multi-modal foundation models. Finally, we identify opportunities for further
harnessing the vast amounts of unlabeled, seasonal, and multi-sensor remote
sensing observations.
comment: 20 pages, submitted to ACM SigSpatial, currently under peer review
☆ PhysioSync: Temporal and Cross-Modal Contrastive Learning Inspired by Physiological Synchronization for EEG-Based Emotion Recognition
Electroencephalography (EEG) signals provide a promising and involuntary
reflection of brain activity related to emotional states, offering significant
advantages over behavioral cues like facial expressions. However, EEG signals
are often noisy, affected by artifacts, and vary across individuals,
complicating emotion recognition. While multimodal approaches have used
Peripheral Physiological Signals (PPS) like GSR to complement EEG, they often
overlook the dynamic synchronization and consistent semantics between the
modalities. Additionally, the temporal dynamics of emotional fluctuations
across different time resolutions in PPS remain underexplored. To address these
challenges, we propose PhysioSync, a novel pre-training framework leveraging
temporal and cross-modal contrastive learning, inspired by physiological
synchronization phenomena. PhysioSync incorporates Cross-Modal Consistency
Alignment (CM-CA) to model dynamic relationships between EEG and complementary
PPS, enabling emotion-related synchronizations across modalities. Besides, it
introduces Long- and Short-Term Temporal Contrastive Learning (LS-TCL) to
capture emotional synchronization at different temporal resolutions within
modalities. After pre-training, cross-resolution and cross-modal features are
hierarchically fused and fine-tuned to enhance emotion recognition. Experiments
on DEAP and DREAMER datasets demonstrate PhysioSync's advanced performance
under uni-modal and cross-modal conditions, highlighting its effectiveness for
EEG-centered emotion recognition.
comment: The source code will be publicly available at
https://github.com/MSA-LMC/PhysioSync
☆ A Comprehensive Review on RNA Subcellular Localization Prediction
The subcellular localization of RNAs, including long non-coding RNAs
(lncRNAs), messenger RNAs (mRNAs), microRNAs (miRNAs) and other smaller RNAs,
plays a critical role in determining their biological functions. For instance,
lncRNAs are predominantly associated with chromatin and act as regulators of
gene transcription and chromatin structure, while mRNAs are distributed across
the nucleus and cytoplasm, facilitating the transport of genetic information
for protein synthesis. Understanding RNA localization sheds light on processes
like gene expression regulation with spatial and temporal precision. However,
traditional wet lab methods for determining RNA localization, such as in situ
hybridization, are often time-consuming, resource-demanding, and costly. To
overcome these challenges, computational methods leveraging artificial
intelligence (AI) and machine learning (ML) have emerged as powerful
alternatives, enabling large-scale prediction of RNA subcellular localization.
This paper provides a comprehensive review of the latest advancements in
AI-based approaches for RNA subcellular localization prediction, covering
various RNA types and focusing on sequence-based, image-based, and hybrid
methodologies that combine both data types. We highlight the potential of these
methods to accelerate RNA research, uncover molecular pathways, and guide
targeted disease treatments. Furthermore, we critically discuss the challenges
in AI/ML approaches for RNA subcellular localization, such as data scarcity and
lack of benchmarks, and opportunities to address them. This review aims to
serve as a valuable resource for researchers seeking to develop innovative
solutions in the field of RNA subcellular localization and beyond.
☆ OUI Need to Talk About Weight Decay: A New Perspective on Overfitting Detection
We introduce the Overfitting-Underfitting Indicator (OUI), a novel tool for
monitoring the training dynamics of Deep Neural Networks (DNNs) and identifying
optimal regularization hyperparameters. Specifically, we validate that OUI can
effectively guide the selection of the Weight Decay (WD) hyperparameter by
indicating whether a model is overfitting or underfitting during training
without requiring validation data. Through experiments on DenseNet-BC-100 with
CIFAR- 100, EfficientNet-B0 with TinyImageNet and ResNet-34 with ImageNet-1K,
we show that maintaining OUI within a prescribed interval correlates strongly
with improved generalization and validation scores. Notably, OUI converges
significantly faster than traditional metrics such as loss or accuracy,
enabling practitioners to identify optimal WD (hyperparameter) values within
the early stages of training. By leveraging OUI as a reliable indicator, we can
determine early in training whether the chosen WD value leads the model to
underfit the training data, overfit, or strike a well-balanced trade-off that
maximizes validation scores. This enables more precise WD tuning for optimal
performance on the tested datasets and DNNs. All code for reproducing these
experiments is available at https://github.com/AlbertoFdezHdez/OUI.
comment: 10 pages, 3 figures
♻ ☆ HierarQ: Task-Aware Hierarchical Q-Former for Enhanced Video Understanding CVPR 2025
Despite advancements in multimodal large language models (MLLMs), current
approaches struggle in medium-to-long video understanding due to frame and
context length limitations. As a result, these models often depend on frame
sampling, which risks missing key information over time and lacks task-specific
relevance. To address these challenges, we introduce HierarQ, a task-aware
hierarchical Q-Former based framework that sequentially processes frames to
bypass the need for frame sampling, while avoiding LLM's context length
limitations. We introduce a lightweight two-stream language-guided feature
modulator to incorporate task awareness in video understanding, with the entity
stream capturing frame-level object information within a short context and the
scene stream identifying their broader interactions over longer period of time.
Each stream is supported by dedicated memory banks which enables our proposed
Hierachical Querying transformer (HierarQ) to effectively capture short and
long-term context. Extensive evaluations on 10 video benchmarks across video
understanding, question answering, and captioning tasks demonstrate HierarQ's
state-of-the-art performance across most datasets, proving its robustness and
efficiency for comprehensive video analysis.
comment: Accepted in CVPR 2025
♻ ☆ DiffKillR: Killing and Recreating Diffeomorphisms for Cell Annotation in Dense Microscopy Images ICASSP 2025
Chen Liu, Danqi Liao, Alejandro Parada-Mayorga, Alejandro Ribeiro, Marcello DiStasio, Smita Krishnaswamy
The proliferation of digital microscopy images, driven by advances in
automated whole slide scanning, presents significant opportunities for
biomedical research and clinical diagnostics. However, accurately annotating
densely packed information in these images remains a major challenge. To
address this, we introduce DiffKillR, a novel framework that reframes cell
annotation as the combination of archetype matching and image registration
tasks. DiffKillR employs two complementary neural networks: one that learns a
diffeomorphism-invariant feature space for robust cell matching and another
that computes the precise warping field between cells for annotation mapping.
Using a small set of annotated archetypes, DiffKillR efficiently propagates
annotations across large microscopy images, reducing the need for extensive
manual labeling. More importantly, it is suitable for any type of pixel-level
annotation. We will discuss the theoretical properties of DiffKillR and
validate it on three microscopy tasks, demonstrating its advantages over
existing supervised, semi-supervised, and unsupervised methods. The code is
available at https://github.com/KrishnaswamyLab/DiffKillR.
comment: ICASSP 2025, Oral Presentation
♻ ☆ ImageFlowNet: Forecasting Multiscale Image-Level Trajectories of Disease Progression with Irregularly-Sampled Longitudinal Medical Images ICASSP 2025
Chen Liu, Ke Xu, Liangbo L. Shen, Guillaume Huguet, Zilong Wang, Alexander Tong, Danilo Bzdok, Jay Stewart, Jay C. Wang, Lucian V. Del Priore, Smita Krishnaswamy
Advances in medical imaging technologies have enabled the collection of
longitudinal images, which involve repeated scanning of the same patients over
time, to monitor disease progression. However, predictive modeling of such data
remains challenging due to high dimensionality, irregular sampling, and data
sparsity. To address these issues, we propose ImageFlowNet, a novel model
designed to forecast disease trajectories from initial images while preserving
spatial details. ImageFlowNet first learns multiscale joint representation
spaces across patients and time points, then optimizes deterministic or
stochastic flow fields within these spaces using a position-parameterized
neural ODE/SDE framework. The model leverages a UNet architecture to create
robust multiscale representations and mitigates data scarcity by combining
knowledge from all patients. We provide theoretical insights that support our
formulation of ODEs, and motivate our regularizations involving high-level
visual features, latent space organization, and trajectory smoothness. We
validate ImageFlowNet on three longitudinal medical image datasets depicting
progression in geographic atrophy, multiple sclerosis, and glioblastoma,
demonstrating its ability to effectively forecast disease progression and
outperform existing methods. Our contributions include the development of
ImageFlowNet, its theoretical underpinnings, and empirical validation on
real-world datasets. The official implementation is available at
https://github.com/KrishnaswamyLab/ImageFlowNet.
comment: ICASSP 2025, Oral Presentation
♻ ☆ jina-clip-v2: Multilingual Multimodal Embeddings for Text and Images
Andreas Koukounas, Georgios Mastrapas, Sedigheh Eslami, Bo Wang, Mohammad Kalim Akram, Michael Günther, Isabelle Mohr, Saba Sturua, Nan Wang, Han Xiao
Contrastive Language-Image Pretraining (CLIP) has been widely used for
crossmodal information retrieval and multimodal understanding tasks. However,
CLIP models are mainly optimized for crossmodal vision-language tasks and
underperform in single-mode text tasks. Moreover, these models are often
trained on English datasets and therefore lack multilingual understanding.
Additionally, from a visual understanding perspective, previous CLIP-based
models exhibit insufficient understanding of visually rich documents. In this
work, we propose jina-clip-v2, a contrastive vision-language model trained on
text pairs, triplets and image-text pairs via a multi-task and multi-stage
contrastive learning paradigm in order to support both text-only and crossmodal
tasks. We employ a multilingual text encoder and expand the training dataset to
include multilingual texts from 29 non-English languages, including Hindi,
Chinese, German, French, and others, as well as images of visually rich
documents. We evaluate the model's performance and show that jina-clip-v2
achieves notable improvements over state-of-the-art CLIP-based models in
zero-shot text-only retrieval, semantic textual similarity, and crossmodal
retrieval tasks in both English and multilingual settings. jina-clip-v2 also
provides for flexibility in embedding dimensionality, enabling users to select
the granularity of the representations. jina-clip-v2 is publicly available at
https://huggingface.co/jinaai/jina-clip-v2.
comment: 30 pages, 1-10 main paper, 10-12 refs, 12-30 benchmarks
♻ ☆ DDU-Net: A Domain Decomposition-Based CNN for High-Resolution Image Segmentation on Multiple GPUs
The segmentation of ultra-high resolution images poses challenges such as
loss of spatial information or computational inefficiency. In this work, a
novel approach that combines encoder-decoder architectures with domain
decomposition strategies to address these challenges is proposed. Specifically,
a domain decomposition-based U-Net (DDU-Net) architecture is introduced, which
partitions input images into non-overlapping patches that can be processed
independently on separate devices. A communication network is added to
facilitate inter-patch information exchange to enhance the understanding of
spatial context. Experimental validation is performed on a synthetic dataset
that is designed to measure the effectiveness of the communication network.
Then, the performance is tested on the DeepGlobe land cover classification
dataset as a real-world benchmark data set. The results demonstrate that the
approach, which includes inter-patch communication for images divided into
$16\times16$ non-overlapping subimages, achieves a $2-3\,\%$ higher
intersection over union (IoU) score compared to the same network without
inter-patch communication. The performance of the network which includes
communication is equivalent to that of a baseline U-Net trained on the full
image, showing that our model provides an effective solution for segmenting
ultra-high-resolution images while preserving spatial context. The code is
available at https://github.com/corne00/DDU-Net.
♻ ☆ Lumina-mGPT: Illuminate Flexible Photorealistic Text-to-Image Generation with Multimodal Generative Pretraining
Dongyang Liu, Shitian Zhao, Le Zhuo, Weifeng Lin, Yi Xin, Xinyue Li, Qi Qin, Yu Qiao, Hongsheng Li, Peng Gao
We present Lumina-mGPT, a family of multimodal autoregressive models capable
of various vision and language tasks, particularly excelling in generating
flexible photorealistic images from text descriptions. By initializing from
multimodal Generative PreTraining (mGPT), we demonstrate that decoder-only
Autoregressive (AR) model can achieve image generation performance comparable
to modern diffusion models with high efficiency through Flexible Progressive
Supervised Fine-tuning (FP-SFT). Equipped with our proposed Unambiguous image
Representation (UniRep), Lumina-mGPT can flexibly generate high-quality images
of varying aspect ratios. Building on the strong image generation capabilities,
we further explore Ominiponent Supervised Fine-tuning (Omni-SFT), an initial
attempt to elevate Lumina-mGPT into a unified multi-modal generalist. The
resulting model demonstrates versatile multimodal capabilities, including
visual generation tasks like text-to-image/multiview generation and
controllable generation, visual recognition tasks like segmentation and depth
estimation, and vision-language tasks like multi-turn visual question
answering, showing the rosy potential of the technical direction. Codes and
checkpoints are available at https://github.com/Alpha-VLLM/Lumina-mGPT.
comment: Code available at: https://github.com/Alpha-VLLM/Lumina-mGPT
♻ ☆ Weak-to-Strong Diffusion with Reflection
The goal of diffusion generative models is to align the learned distribution
with the real data distribution through gradient score matching. However,
inherent limitations in training data quality, modeling strategies, and
architectural design lead to inevitable gap between generated outputs and real
data. To reduce this gap, we propose Weak-to-Strong Diffusion (W2SD), a novel
framework that utilizes the estimated difference between existing weak and
strong models (i.e., weak-to-strong difference) to bridge the gap between an
ideal model and a strong model. By employing a reflective operation that
alternates between denoising and inversion with weak-to-strong difference, we
theoretically understand that W2SD steers latent variables along sampling
trajectories toward regions of the real data distribution. W2SD is highly
flexible and broadly applicable, enabling diverse improvements through the
strategic selection of weak-to-strong model pairs (e.g., DreamShaper vs. SD1.5,
good experts vs. bad experts in MoE). Extensive experiments demonstrate that
W2SD significantly improves human preference, aesthetic quality, and prompt
adherence, achieving SOTA performance across various modalities (e.g., image,
video), architectures (e.g., UNet-based, DiT-based, MoE), and benchmarks. For
example, Juggernaut-XL with W2SD can improve with the HPSv2 winning rate up to
90% over the original results. Moreover, the performance gains achieved by W2SD
markedly outweigh its additional computational overhead, while the cumulative
improvements from different weak-to-strong difference further solidify its
practical utility and deployability.
comment: 23 pages, 23 figures, 15 tables
♻ ☆ Contrastive Learning with Synthetic Positives
Contrastive learning with the nearest neighbor has proved to be one of the
most efficient self-supervised learning (SSL) techniques by utilizing the
similarity of multiple instances within the same class. However, its efficacy
is constrained as the nearest neighbor algorithm primarily identifies "easy"
positive pairs, where the representations are already closely located in the
embedding space. In this paper, we introduce a novel approach called
Contrastive Learning with Synthetic Positives (CLSP) that utilizes synthetic
images, generated by an unconditional diffusion model, as the additional
positives to help the model learn from diverse positives. Through feature
interpolation in the diffusion model sampling process, we generate images with
distinct backgrounds yet similar semantic content to the anchor image. These
images are considered "hard" positives for the anchor image, and when included
as supplementary positives in the contrastive loss, they contribute to a
performance improvement of over 2% and 1% in linear evaluation compared to the
previous NNCLR and All4One methods across multiple benchmark datasets such as
CIFAR10, achieving state-of-the-art methods. On transfer learning benchmarks,
CLSP outperforms existing SSL frameworks on 6 out of 8 downstream datasets. We
believe CLSP establishes a valuable baseline for future SSL studies
incorporating synthetic data in the training process.
comment: 8 pages, conference
♻ ☆ ARF-Plus: Controlling Perceptual Factors in Artistic Radiance Fields for 3D Scene Stylization WACV 2025
The radiance fields style transfer is an emerging field that has recently
gained popularity as a means of 3D scene stylization, thanks to the outstanding
performance of neural radiance fields in 3D reconstruction and view synthesis.
We highlight a research gap in radiance fields style transfer, the lack of
sufficient perceptual controllability, motivated by the existing concept in the
2D image style transfer. In this paper, we present ARF-Plus, a 3D neural style
transfer framework offering manageable control over perceptual factors, to
systematically explore the perceptual controllability in 3D scene stylization.
Four distinct types of controls - color preservation control, (style pattern)
scale control, spatial (selective stylization area) control, and depth
enhancement control - are proposed and integrated into this framework. Results
from real-world datasets, both quantitative and qualitative, show that the four
types of controls in our ARF-Plus framework successfully accomplish their
corresponding perceptual controls when stylizing 3D scenes. These techniques
work well for individual style inputs as well as for the simultaneous
application of multiple styles within a scene. This unlocks a realm of
limitless possibilities, allowing customized modifications of stylization
effects and flexible merging of the strengths of different styles, ultimately
enabling the creation of novel and eye-catching stylistic effects on 3D scenes.
comment: Accepted at WACV 2025. The published version is available at
https://ieeexplore.ieee.org/document/10944114
♻ ☆ Variational Self-Supervised Learning NeurIPS 2025
We present Variational Self-Supervised Learning (VSSL), a novel framework
that combines variational inference with self-supervised learning to enable
efficient, decoder-free representation learning. Unlike traditional VAEs that
rely on input reconstruction via a decoder, VSSL symmetrically couples two
encoders with Gaussian outputs. A momentum-updated teacher network defines a
dynamic, data-dependent prior, while the student encoder produces an
approximate posterior from augmented views. The reconstruction term in the ELBO
is replaced with a cross-view denoising objective, preserving the analytical
tractability of Gaussian KL divergence. We further introduce cosine-based
formulations of KL and log-likelihood terms to enhance semantic alignment in
high-dimensional latent spaces. Experiments on CIFAR-10, CIFAR-100, and
ImageNet-100 show that VSSL achieves competitive or superior performance to
leading self-supervised methods, including BYOL and MoCo V3. VSSL offers a
scalable, probabilistically grounded approach to learning transferable
representations without generative reconstruction, bridging the gap between
variational modeling and modern self-supervised techniques.
comment: NeurIPS 2025 - SSL Workshop Submission
♻ ☆ Putting the Segment Anything Model to the Test with 3D Knee MRI - A Comparison with State-of-the-Art Performance BMVC 2024
Menisci are cartilaginous tissue found within the knee that contribute to
joint lubrication and weight dispersal. Damage to menisci can lead to onset and
progression of knee osteoarthritis (OA), a condition that is a leading cause of
disability, and for which there are few effective therapies. Accurate automated
segmentation of menisci would allow for earlier detection and treatment of
meniscal abnormalities, as well as shedding more light on the role the menisci
play in OA pathogenesis. Focus in this area has mainly used variants of
convolutional networks, but there has been no attempt to utilise recent large
vision transformer segmentation models. The Segment Anything Model (SAM) is a
so-called foundation segmentation model, which has been found useful across a
range of different tasks due to the large volume of data used for training the
model. In this study, SAM was adapted to perform fully-automated segmentation
of menisci from 3D knee magnetic resonance images. A 3D U-Net was also trained
as a baseline. It was found that, when fine-tuning only the decoder, SAM was
unable to compete with 3D U-Net, achieving a Dice score of $0.81\pm0.03$,
compared to $0.87\pm0.03$, on a held-out test set. When fine-tuning SAM
end-to-end, a Dice score of $0.87\pm0.03$ was achieved. The performance of both
the end-to-end trained SAM configuration and the 3D U-Net were comparable to
the winning Dice score ($0.88\pm0.03$) in the IWOAI Knee MRI Segmentation
Challenge 2019. Performance in terms of the Hausdorff Distance showed that both
configurations of SAM were inferior to 3D U-Net in matching the meniscus
morphology. Results demonstrated that, despite its generalisability, SAM was
unable to outperform a basic 3D U-Net in meniscus segmentation, and may not be
suitable for similar 3D medical image segmentation tasks also involving fine
anatomical structures with low contrast and poorly-defined boundaries.
comment: Work accepted at BMVC 2024. Minor changes to the camera-ready version
since acceptance include a corrected running header and the addition of an
Acknowledgments section (including code availability)
♻ ☆ FMNV: A Dataset of Media-Published News Videos for Fake News Detection
News media, particularly video-based platforms, have become deeply embedded
in daily life, concurrently amplifying risks of misinformation dissemination.
Consequently, multimodal fake news detection has garnered significant research
attention. However, existing datasets predominantly comprise user-generated
videos characterized by crude editing and limited public engagement, whereas
professionally crafted fake news videos disseminated by media outlets, often
politically or virally motivated-pose substantially greater societal harm. To
address this gap, we construct FMNV, a novel dataset exclusively composed of
news videos published by media organizations. Through empirical analysis of
existing datasets and our curated collection, we categorize fake news videos
into four distinct types. Building upon this taxonomy, we employ Large Language
Models (LLMs) to automatically generate deceptive content by manipulating
authentic media-published news videos. Furthermore, we propose FMNVD, a
baseline model featuring a dual-stream architecture integrating CLIP and Faster
R-CNN for video feature extraction, enhanced by co-attention mechanisms for
feature refinement and multimodal aggregation. Comparative experiments
demonstrate both the generalization capability of FMNV across multiple
baselines and the superior detection efficacy of FMNVD. This work establishes
critical benchmarks for detecting high-impact fake news in media ecosystems
while advancing methodologies for cross-modal inconsistency analysis.
♻ ☆ Continuous and complete liver vessel segmentation with graph-attention guided diffusion
Improving connectivity and completeness are the most challenging aspects of
liver vessel segmentation, especially for small vessels. These challenges
require both learning the continuous vessel geometry and focusing on small
vessel detection. However, current methods do not explicitly address these two
aspects and cannot generalize well when constrained by inconsistent
annotations. Here, we take advantage of the generalization of the diffusion
model and explicitly integrate connectivity and completeness in our
diffusion-based segmentation model. Specifically, we use a graph-attention
module that adds knowledge about vessel geometry. Additionally, we perform the
graph-attention at multiple-scales, thus focusing on small liver vessels. Our
method outperforms five state-of-the-art medical segmentation methods on two
public datasets: 3D-ircadb-01 and LiVS.
comment: Second version
♻ ☆ Latent Representations for Visual Proprioception in Inexpensive Robots
Robotic manipulation requires explicit or implicit knowledge of the robot's
joint positions. Precise proprioception is standard in high-quality industrial
robots but is often unavailable in inexpensive robots operating in unstructured
environments. In this paper, we ask: to what extent can a fast, single-pass
regression architecture perform visual proprioception from a single external
camera image, available even in the simplest manipulation settings? We explore
several latent representations, including CNNs, VAEs, ViTs, and bags of
uncalibrated fiducial markers, using fine-tuning techniques adapted to the
limited data available. We evaluate the achievable accuracy through experiments
on an inexpensive 6-DoF robot.
♻ ☆ Disentangling Visual Transformers: Patch-level Interpretability for Image Classification CVPR 2025
Visual transformers have achieved remarkable performance in image
classification tasks, but this performance gain has come at the cost of
interpretability. One of the main obstacles to the interpretation of
transformers is the self-attention mechanism, which mixes visual information
across the whole image in a complex way. In this paper, we propose Hindered
Transformer (HiT), a novel interpretable by design architecture inspired by
visual transformers. Our proposed architecture rethinks the design of
transformers to better disentangle patch influences at the classification
stage. Ultimately, HiT can be interpreted as a linear combination of
patch-level information. We show that the advantages of our approach in terms
of explicability come with a reasonable trade-off in performance, making it an
attractive alternative for applications where interpretability is paramount.
comment: CVPR 2025 official version. Main manuscript + supplementary
♻ ☆ ObjectAdd: Adding Objects into Image via a Training-Free Diffusion Modification Fashion
We introduce ObjectAdd, a training-free diffusion modification method to add
user-expected objects into user-specified area. The motive of ObjectAdd stems
from: first, describing everything in one prompt can be difficult, and second,
users often need to add objects into the generated image. To accommodate with
real world, our ObjectAdd maintains accurate image consistency after adding
objects with technical innovations in: (1) embedding-level concatenation to
ensure correct text embedding coalesce; (2) object-driven layout control with
latent and attention injection to ensure objects accessing user-specified area;
(3) prompted image inpainting in an attention refocusing & object expansion
fashion to ensure rest of the image stays the same. With a text-prompted image,
our ObjectAdd allows users to specify a box and an object, and achieves: (1)
adding object inside the box area; (2) exact content outside the box area; (3)
flawless fusion between the two areas
comment: 13 pages in total
♻ ☆ AgentsCoMerge: Large Language Model Empowered Collaborative Decision Making for Ramp Merging IEEE
Ramp merging is one of the bottlenecks in traffic systems, which commonly
cause traffic congestion, accidents, and severe carbon emissions. In order to
address this essential issue and enhance the safety and efficiency of connected
and autonomous vehicles (CAVs) at multi-lane merging zones, we propose a novel
collaborative decision-making framework, named AgentsCoMerge, to leverage large
language models (LLMs). Specifically, we first design a scene observation and
understanding module to allow an agent to capture the traffic environment. Then
we propose a hierarchical planning module to enable the agent to make decisions
and plan trajectories based on the observation and the agent's own state. In
addition, in order to facilitate collaboration among multiple agents, we
introduce a communication module to enable the surrounding agents to exchange
necessary information and coordinate their actions. Finally, we develop a
reinforcement reflection guided training paradigm to further enhance the
decision-making capability of the framework. Extensive experiments are
conducted to evaluate the performance of our proposed method, demonstrating its
superior efficiency and effectiveness for multi-agent collaborative
decision-making under various ramp merging scenarios.
comment: Accepted by IEEE Transactions on Mobile Computing (TMC)
♻ ☆ A New Graph Grammar Formalism for Robust Syntactic Pattern Recognition
I introduce a formalism for representing the syntax of recursively structured
graph-like patterns. It does not use production rules, like a conventional
graph grammar, but represents the syntactic structure in a more direct and
declarative way. The grammar and the pattern are both represented as networks,
and parsing is seen as the construction of a homomorphism from the pattern to
the grammar. The grammars can represent iterative, hierarchical and nested
recursive structure in more than one dimension.
This supports a highly parallel style of parsing, in which all aspects of
pattern recognition (feature detection, segmentation, parsing, filling in
missing symbols, top-down and bottom-up inference) are integrated into a single
process, to exploit the synergy between them.
The emphasis of this paper is on underlying theoretical issues, but I also
give some example runs to illustrate the error-tolerant parsing of complex
recursively structured patterns of 50-1000 symbols, involving variability in
geometric relationships, blurry and indistinct symbols, overlapping symbols,
cluttered images, and erased patches.
comment: 64 pages, 23 figures. Version 2: mathematical supplement added, 98
pages, 1 figure
♻ ☆ Causal Disentanglement for Robust Long-tail Medical Image Generation
Counterfactual medical image generation effectively addresses data scarcity
and enhances the interpretability of medical images. However, due to the
complex and diverse pathological features of medical images and the imbalanced
class distribution in medical data, generating high-quality and diverse medical
images from limited data is significantly challenging. Additionally, to fully
leverage the information in limited data, such as anatomical structure
information and generate more structurally stable medical images while avoiding
distortion or inconsistency. In this paper, in order to enhance the clinical
relevance of generated data and improve the interpretability of the model, we
propose a novel medical image generation framework, which generates independent
pathological and structural features based on causal disentanglement and
utilizes text-guided modeling of pathological features to regulate the
generation of counterfactual images. First, we achieve feature separation
through causal disentanglement and analyze the interactions between features.
Here, we introduce group supervision to ensure the independence of pathological
and identity features. Second, we leverage a diffusion model guided by
pathological findings to model pathological features, enabling the generation
of diverse counterfactual images. Meanwhile, we enhance accuracy by leveraging
a large language model to extract lesion severity and location from medical
reports. Additionally, we improve the performance of the latent diffusion model
on long-tailed categories through initial noise optimization.
♻ ☆ PhysFlow: Unleashing the Potential of Multi-modal Foundation Models and Video Diffusion for 4D Dynamic Physical Scene Simulation CVPR 2025
Realistic simulation of dynamic scenes requires accurately capturing diverse
material properties and modeling complex object interactions grounded in
physical principles. However, existing methods are constrained to basic
material types with limited predictable parameters, making them insufficient to
represent the complexity of real-world materials. We introduce PhysFlow, a
novel approach that leverages multi-modal foundation models and video diffusion
to achieve enhanced 4D dynamic scene simulation. Our method utilizes
multi-modal models to identify material types and initialize material
parameters through image queries, while simultaneously inferring 3D Gaussian
splats for detailed scene representation. We further refine these material
parameters using video diffusion with a differentiable Material Point Method
(MPM) and optical flow guidance rather than render loss or Score Distillation
Sampling (SDS) loss. This integrated framework enables accurate prediction and
realistic simulation of dynamic interactions in real-world scenarios, advancing
both accuracy and flexibility in physics-based simulations.
comment: CVPR 2025. Homepage: https://zhuomanliu.github.io/PhysFlow/
♻ ☆ Large-image Object Detection for Fine-grained Recognition of Punches Patterns in Medieval Panel Painting
Josh Bruegger, Diana Ioana Catana, Vanja Macovaz, Matias Valdenegro-Toro, Matthia Sabatelli, Marco Zullich
The attribution of the author of an art piece is typically a laborious manual
process, usually relying on subjective evaluations of expert figures. However,
there are some situations in which quantitative features of the artwork can
support these evaluations. The extraction of these features can sometimes be
automated, for instance, with the use of Machine Learning (ML) techniques. An
example of these features is represented by repeated, mechanically impressed
patterns, called punches, present chiefly in 13th and 14th-century panel
paintings from Tuscany. Previous research in art history showcased a strong
connection between the shapes of punches and specific artists or workshops,
suggesting the possibility of using these quantitative cues to support the
attribution. In the present work, we first collect a dataset of large-scale
images of these panel paintings. Then, using YOLOv10, a recent and popular
object detection model, we train a ML pipeline to perform object detection on
the punches contained in the images. Due to the large size of the images, the
detection procedure is split across multiple frames by adopting a
sliding-window approach with overlaps, after which the predictions are combined
for the whole image using a custom non-maximal suppression routine. Our results
indicate how art historians working in the field can reliably use our method
for the identification and extraction of punches.
♻ ☆ Shifts in Doctors' Eye Movements Between Real and AI-Generated Medical Images
David C Wong, Bin Wang, Gorkem Durak, Marouane Tliba, Mohamed Amine Kerkouri, Aladine Chetouani, Ahmet Enis Cetin, Cagdas Topel, Nicolo Gennaro, Camila Vendrami, Tugce Agirlar Trabzonlu, Amir Ali Rahsepar, Laetitia Perronne, Matthew Antalek, Onural Ozturk, Gokcan Okur, Andrew C. Gordon, Ayis Pyrros, Frank H Miller, Amir A Borhani, Hatice Savas, Eric M. Hart, Elizabeth A Krupinski, Ulas Bagci
Eye-tracking analysis plays a vital role in medical imaging, providing key
insights into how radiologists visually interpret and diagnose clinical cases.
In this work, we first analyze radiologists' attention and agreement by
measuring the distribution of various eye-movement patterns, including saccades
direction, amplitude, and their joint distribution. These metrics help uncover
patterns in attention allocation and diagnostic strategies. Furthermore, we
investigate whether and how doctors' gaze behavior shifts when viewing
authentic (Real) versus deep-learning-generated (Fake) images. To achieve this,
we examine fixation bias maps, focusing on first, last, short, and longest
fixations independently, along with detailed saccades patterns, to quantify
differences in gaze distribution and visual saliency between authentic and
synthetic images.
comment: This paper was accepted at ETRA 2025 Japan
♻ ☆ Dynamic Pyramid Network for Efficient Multimodal Large Language Model
Hao Ai, Kunyi Wang, Zezhou Wang, Hao Lu, Jin Tian, Yaxin Luo, Peng Xing, Jen-Yuan Huang, Huaxia Li, Gen luo
Multimodal large language models (MLLMs) have demonstrated impressive
performance in various vision-language (VL) tasks, but their expensive
computations still limit the real-world application. To address this issue,
recent efforts aim to compress the visual features to save the computational
costs of MLLMs. However, direct visual compression methods, e.g. efficient
projectors, inevitably destroy the visual semantics in MLLM, especially in
difficult samples. To overcome this shortcoming, we propose a novel dynamic
pyramid network (DPN) for efficient MLLMs. Specifically, DPN formulates MLLM as
a hierarchical structure where visual features are gradually compressed with
increasing depth. In this case, even with a high compression ratio,
fine-grained visual information can still be perceived in shallow layers. To
maximize the benefit of DPN, we further propose an innovative Dynamic Pooling
Experts (DPE) that can dynamically choose the optimal visual compression rate
according to input features. With this design, harder samples will be assigned
larger computations, thus preserving the model performance. To validate our
approach, we conduct extensive experiments on two popular MLLMs and ten
benchmarks. Experimental results show that DPN can save up to 56% average FLOPs
on LLaVA while further achieving +0.74% performance gains. Besides, the
generalization ability of DPN is also validated on the existing high-resolution
MLLM called LLaVA-HR. The source code will be released at
https://github.com/aihao2000/DPN-LLaVA.
♻ ☆ QUART-Online: Latency-Free Large Multimodal Language Model for Quadruped Robot Learning ICRA 2025
Xinyang Tong, Pengxiang Ding, Yiguo Fan, Donglin Wang, Wenjie Zhang, Can Cui, Mingyang Sun, Han Zhao, Hongyin Zhang, Yonghao Dang, Siteng Huang, Shangke Lyu
This paper addresses the inherent inference latency challenges associated
with deploying multimodal large language models (MLLM) in quadruped
vision-language-action (QUAR-VLA) tasks. Our investigation reveals that
conventional parameter reduction techniques ultimately impair the performance
of the language foundation model during the action instruction tuning phase,
making them unsuitable for this purpose. We introduce a novel latency-free
quadruped MLLM model, dubbed QUART-Online, designed to enhance inference
efficiency without degrading the performance of the language foundation model.
By incorporating Action Chunk Discretization (ACD), we compress the original
action representation space, mapping continuous action values onto a smaller
set of discrete representative vectors while preserving critical information.
Subsequently, we fine-tune the MLLM to integrate vision, language, and
compressed actions into a unified semantic space. Experimental results
demonstrate that QUART-Online operates in tandem with the existing MLLM system,
achieving real-time inference in sync with the underlying controller frequency,
significantly boosting the success rate across various tasks by 65%. Our
project page is https://quart-online.github.io.
comment: Accepted to ICRA 2025; Github page: https://quart-online.github.io
♻ ☆ Review of Demographic Fairness in Face Recognition
Demographic fairness in face recognition (FR) has emerged as a critical area
of research, given its impact on fairness, equity, and reliability across
diverse applications. As FR technologies are increasingly deployed globally,
disparities in performance across demographic groups-- such as race, ethnicity,
and gender-- have garnered significant attention. These biases not only
compromise the credibility of FR systems but also raise ethical concerns,
especially when these technologies are employed in sensitive domains. This
review consolidates extensive research efforts providing a comprehensive
overview of the multifaceted aspects of demographic fairness in FR.
We systematically examine the primary causes, datasets, assessment metrics,
and mitigation approaches associated with demographic disparities in FR. By
categorizing key contributions in these areas, this work provides a structured
approach to understanding and addressing the complexity of this issue. Finally,
we highlight current advancements and identify emerging challenges that need
further investigation. This article aims to provide researchers with a unified
perspective on the state-of-the-art while emphasizing the critical need for
equitable and trustworthy FR systems.
comment: under review
♻ ☆ 3D-LLaVA: Towards Generalist 3D LMMs with Omni Superpoint Transformer CVPR 2025
Current 3D Large Multimodal Models (3D LMMs) have shown tremendous potential
in 3D-vision-based dialogue and reasoning. However, how to further enhance 3D
LMMs to achieve fine-grained scene understanding and facilitate flexible
human-agent interaction remains a challenging problem. In this work, we
introduce 3D-LLaVA, a simple yet highly powerful 3D LMM designed to act as an
intelligent assistant in comprehending, reasoning, and interacting with the 3D
world. Unlike existing top-performing methods that rely on complicated
pipelines-such as offline multi-view feature extraction or additional
task-specific heads-3D-LLaVA adopts a minimalist design with integrated
architecture and only takes point clouds as input. At the core of 3D-LLaVA is a
new Omni Superpoint Transformer (OST), which integrates three functionalities:
(1) a visual feature selector that converts and selects visual tokens, (2) a
visual prompt encoder that embeds interactive visual prompts into the visual
token space, and (3) a referring mask decoder that produces 3D masks based on
text description. This versatile OST is empowered by the hybrid pretraining to
obtain perception priors and leveraged as the visual connector that bridges the
3D data to the LLM. After performing unified instruction tuning, our 3D-LLaVA
reports impressive results on various benchmarks.
comment: Accepted by CVPR 2025
♻ ☆ Machine Learning-Based Automated Assessment of Intracorporeal Suturing in Laparoscopic Fundoplication
Shekhar Madhav Khairnar, Huu Phong Nguyen, Alexis Desir, Carla Holcomb, Daniel J. Scott, Ganesh Sankaranarayanan
Automated assessment of surgical skills using artificial intelligence (AI)
provides trainees with instantaneous feedback. After bimanual tool motions are
captured, derived kinematic metrics are reliable predictors of performance in
laparoscopic tasks. Implementing automated tool tracking requires
time-intensive human annotation. We developed AI-based tool tracking using the
Segment Anything Model (SAM) to eliminate the need for human annotators. Here,
we describe a study evaluating the usefulness of our tool tracking model in
automated assessment during a laparoscopic suturing task in the fundoplication
procedure. An automated tool tracking model was applied to recorded videos of
Nissen fundoplication on porcine bowel. Surgeons were grouped as novices
(PGY1-2) and experts (PGY3-5, attendings). The beginning and end of each
suturing step were segmented, and motions of the left and right tools were
extracted. A low-pass filter with a 24 Hz cut-off frequency removed noise.
Performance was assessed using supervised and unsupervised models, and an
ablation study compared results. Kinematic features--RMS velocity, RMS
acceleration, RMS jerk, total path length, and Bimanual Dexterity--were
extracted and analyzed using Logistic Regression, Random Forest, Support Vector
Classifier, and XGBoost. PCA was performed for feature reduction. For
unsupervised learning, a Denoising Autoencoder (DAE) model with classifiers,
such as a 1-D CNN and traditional models, was trained. Data were extracted for
28 participants (9 novices, 19 experts). Supervised learning with PCA and
Random Forest achieved an accuracy of 0.795 and an F1 score of 0.778. The
unsupervised 1-D CNN achieved superior results with an accuracy of 0.817 and an
F1 score of 0.806, eliminating the need for kinematic feature computation. We
demonstrated an AI model capable of automated performance classification,
independent of human annotation.
comment: 17 pages
♻ ☆ Vidi: Large Multimodal Models for Video Understanding and Editing
Vidi Team, Celong Liu, Chia-Wen Kuo, Dawei Du, Fan Chen, Guang Chen, Jiamin Yuan, Lingxi Zhang, Lu Guo, Lusha Li, Longyin Wen, Qingyu Chen, Rachel Deng, Sijie Zhu, Stuart Siew, Tong Jin, Wei Lu, Wen Zhong, Xiaohui Shen, Xin Gu, Xing Mei, Xueqiong Qu
Humans naturally share information with those they are connected to, and
video has become one of the dominant mediums for communication and expression
on the Internet. To support the creation of high-quality large-scale video
content, a modern pipeline requires a comprehensive understanding of both the
raw input materials (e.g., the unedited footage captured by cameras) and the
editing components (e.g., visual effects). In video editing scenarios, models
must process multiple modalities (e.g., vision, audio, text) with strong
background knowledge and handle flexible input lengths (e.g., hour-long raw
videos), which poses significant challenges for traditional models. In this
report, we introduce Vidi, a family of Large Multimodal Models (LMMs) for a
wide range of video understand editing scenarios. The first release focuses on
temporal retrieval, i.e., identifying the time ranges within the input videos
corresponding to a given text query, which plays a critical role in intelligent
editing. The model is capable of processing hour-long videos with strong
temporal understanding capability, e.g., retrieve time ranges for certain
queries. To support a comprehensive evaluation in real-world scenarios, we also
present the VUE-TR benchmark, which introduces five key advancements. 1) Video
duration: significantly longer than videos of existing temporal retrival
datasets, 2) Audio support: includes audio-based queries, 3) Query format:
diverse query lengths/formats, 4) Annotation quality: ground-truth time ranges
are manually annotated. 5) Evaluation metric: a refined IoU metric to support
evaluation over multiple time ranges. Remarkably, Vidi significantly
outperforms leading proprietary models, e.g., GPT-4o and Gemini, on the
temporal retrieval task, indicating its superiority in video editing scenarios.
♻ ☆ OmniMamba4D: Spatio-temporal Mamba for longitudinal CT lesion segmentation IEEE
Justin Namuk Kim, Yiqiao Liu, Rajath Soans, Keith Persson, Sarah Halek, Michal Tomaszewski, Jianda Yuan, Gregory Goldmacher, Antong Chen
Accurate segmentation of longitudinal CT scans is important for monitoring
tumor progression and evaluating treatment responses. However, existing 3D
segmentation models solely focus on spatial information. To address this gap,
we propose OmniMamba4D, a novel segmentation model designed for 4D medical
images (3D images over time). OmniMamba4D utilizes a spatio-temporal
tetra-orientated Mamba block to effectively capture both spatial and temporal
features. Unlike traditional 3D models, which analyze single-time points,
OmniMamba4D processes 4D CT data, providing comprehensive spatio-temporal
information on lesion progression. Evaluated on an internal dataset comprising
of 3,252 CT scans, OmniMamba4D achieves a competitive Dice score of 0.682,
comparable to state-of-the-arts (SOTA) models, while maintaining computational
efficiency and better detecting disappeared lesions. This work demonstrates a
new framework to leverage spatio-temporal information for longitudinal CT
lesion segmentation.
comment: Accepted at IEEE International Symposium on Biomedical Imaging (ISBI)
2025
♻ ☆ How Well Can Vison-Language Models Understand Humans' Intention? An Open-ended Theory of Mind Question Evaluation Benchmark AAAI25
Vision Language Models (VLMs) have demonstrated strong reasoning capabilities
in Visual Question Answering (VQA) tasks; however, their ability to perform
Theory of Mind (ToM) tasks, such as inferring human intentions, beliefs, and
mental states, remains underexplored. We propose an open-ended question
framework to evaluate VLMs' performance across diverse categories of ToM tasks.
We curated and annotated a benchmark dataset of 30 images and evaluated the
performance of four VLMs of varying sizes. Our results show that the GPT-4
model outperformed all the others, with only one smaller model, GPT-4o-mini,
achieving comparable performance. We observed that VLMs often struggle to infer
intentions in complex scenarios such as bullying or cheating. Our findings
reveal that smaller models can sometimes infer correct intentions despite
relying on incorrect visual cues. The dataset is available at
https://github.com/ximingwen/ToM-AAAI25-Multimodal.
comment: 4 pages, accepted by ToM@AAAI25
♻ ☆ Diffusion Models Are Real-Time Game Engines ICLR 2025
We present GameNGen, the first game engine powered entirely by a neural model
that also enables real-time interaction with a complex environment over long
trajectories at high quality. When trained on the classic game DOOM, GameNGen
extracts gameplay and uses it to generate a playable environment that can
interactively simulate new trajectories. GameNGen runs at 20 frames per second
on a single TPU and remains stable over extended multi-minute play sessions.
Next frame prediction achieves a PSNR of 29.4, comparable to lossy JPEG
compression. Human raters are only slightly better than random chance at
distinguishing short clips of the game from clips of the simulation, even after
5 minutes of auto-regressive generation. GameNGen is trained in two phases: (1)
an RL-agent learns to play the game and the training sessions are recorded, and
(2) a diffusion model is trained to produce the next frame, conditioned on the
sequence of past frames and actions. Conditioning augmentations help ensure
stable auto-regressive generation over long trajectories, and decoder
fine-tuning improves the fidelity of visual details and text.
comment: ICLR 2025. Project page: https://gamengen.github.io/
♻ ☆ On the Generalizability of Foundation Models for Crop Type Mapping
Yi-Chia Chang, Adam J. Stewart, Favyen Bastani, Piper Wolters, Shreya Kannan, George R. Huber, Jingtong Wang, Arindam Banerjee
Foundation models pre-trained using self-supervised learning have shown
powerful transfer learning capabilities on various downstream tasks, including
language understanding, text generation, and image recognition. The Earth
observation (EO) field has produced several foundation models pre-trained
directly on multispectral satellite imagery for applications like precision
agriculture, wildfire and drought monitoring, and natural disaster response.
However, few studies have investigated the ability of these models to
generalize to new geographic locations, and potential concerns of geospatial
bias -- models trained on data-rich developed nations not transferring well to
data-scarce developing nations -- remain. We investigate the ability of popular
EO foundation models to transfer to new geographic regions in the agricultural
domain, where differences in farming practices and class imbalance make
transfer learning particularly challenging. We first select five crop
classification datasets across five continents, normalizing for dataset size
and harmonizing classes to focus on four major cereal grains: maize, soybean,
rice, and wheat. We then compare three popular foundation models, pre-trained
on SSL4EO-S12, SatlasPretrain, and ImageNet, using in-distribution (ID) and
out-of-distribution (OOD) evaluation. Experiments show that pre-trained weights
designed explicitly for Sentinel-2, such as SSL4EO-S12, outperform general
pre-trained weights like ImageNet. Furthermore, while only 100 labeled images
are sufficient for achieving high overall accuracy, 900 images are required to
achieve high average accuracy due to class imbalance. All harmonized datasets
and experimental code are open-source and available for download.
♻ ☆ V$^2$R-Bench: Holistically Evaluating LVLM Robustness to Fundamental Visual Variations
Large Vision Language Models (LVLMs) excel in various vision-language tasks.
Yet, their robustness to visual variations in position, scale, orientation, and
context that objects in natural scenes inevitably exhibit due to changes in
viewpoint and environment remains largely underexplored. To bridge this gap, we
introduce V$^2$R-Bench, a comprehensive benchmark framework for evaluating
Visual Variation Robustness of LVLMs, which encompasses automated evaluation
dataset generation and principled metrics for thorough robustness assessment.
Through extensive evaluation on 21 LVLMs, we reveal a surprising vulnerability
to visual variations, in which even advanced models that excel at complex
vision-language tasks significantly underperform on simple tasks such as object
recognition. Interestingly, these models exhibit a distinct visual position
bias that contradicts theories of effective receptive fields, and demonstrate a
human-like visual acuity threshold. To identify the source of these
vulnerabilities, we present a systematic framework for component-level
analysis, featuring a novel visualization approach for aligned visual features.
Results show that these vulnerabilities stem from error accumulation in the
pipeline architecture and inadequate multimodal alignment. Complementary
experiments with synthetic data further demonstrate that these limitations are
fundamentally architectural deficiencies, scoring the need for architectural
innovations in future LVLM designs.
♻ ☆ RSEND: Retinex-based Squeeze and Excitation Network with Dark Region Detection for Efficient Low Light Image Enhancement
Images captured under low-light scenarios often suffer from low quality.
Previous CNN-based deep learning methods often involve using Retinex theory.
Nevertheless, most of them cannot perform well in more complicated datasets
like LOL-v2 while consuming too much computational resources. Besides, some of
these methods require sophisticated training at different stages, making the
procedure even more time-consuming and tedious. In this paper, we propose a
more accurate, concise, and one-stage Retinex theory based framework, RSEND.
RSEND first divides the low-light image into the illumination map and
reflectance map, then captures the important details in the illumination map
and performs light enhancement. After this step, it refines the enhanced
gray-scale image and does element-wise matrix multiplication with the
reflectance map. By denoising the output it has from the previous step, it
obtains the final result. In all the steps, RSEND utilizes Squeeze and
Excitation network to better capture the details. Comprehensive quantitative
and qualitative experiments show that our Efficient Retinex model significantly
outperforms other CNN-based models, achieving a PSNR improvement ranging from
0.44 dB to 4.2 dB in different datasets and even outperforms transformer-based
models in the LOL-v2-real dataset.
♻ ☆ Meta-Entity Driven Triplet Mining for Aligning Medical Vision-Language Models
Diagnostic imaging relies on interpreting both images and radiology reports,
but the growing data volumes place significant pressure on medical experts,
yielding increased errors and workflow backlogs. Medical vision-language models
(med-VLMs) have emerged as a powerful framework to efficiently process
multimodal imaging data, particularly in chest X-ray (CXR) evaluations, albeit
their performance hinges on how well image and text representations are
aligned. Existing alignment methods, predominantly based on contrastive
learning, prioritize separation between disease classes over segregation of
fine-grained pathology attributes like location, size or severity, leading to
suboptimal representations. Here, we propose MedTrim (Meta-entity-driven
Triplet mining), a novel method that enhances image-text alignment through
multimodal triplet learning synergistically guided by disease class as well as
adjectival and directional pathology descriptors. Unlike common alignment
methods that separate broad disease classes, MedTrim leverages structured
meta-entity information to preserve subtle but clinically significant
intra-class variations. For this purpose, we first introduce an ontology-based
entity recognition module that extracts pathology-specific meta-entities from
CXR reports, as annotations on pathology attributes are rare in public
datasets. For refined sample selection in triplet mining, we then introduce a
novel score function that captures an aggregate measure of inter-sample
similarity based on disease classes and adjectival/directional descriptors.
Lastly, we introduce a multimodal triplet alignment objective for explicit
within- and cross-modal alignment between samples sharing detailed pathology
characteristics. Our demonstrations indicate that MedTrim improves performance
in downstream retrieval and classification tasks compared to state-of-the-art
alignment methods.
comment: 18 pages, 7 figures, 6 tables
♻ ☆ Marginalized Generalized IoU (MGIoU): A Unified Objective Function for Optimizing Any Convex Parametric Shapes
Optimizing the similarity between parametric shapes is crucial for numerous
computer vision tasks, where Intersection over Union (IoU) stands as the
canonical measure. However, existing optimization methods exhibit significant
shortcomings: regression-based losses like L1/L2 lack correlation with IoU,
IoU-based losses are unstable and limited to simple shapes, and task-specific
methods are computationally intensive and not generalizable accross domains. As
a result, the current landscape of parametric shape objective functions has
become scattered, with each domain proposing distinct IoU approximations. To
address this, we unify the parametric shape optimization objective functions by
introducing Marginalized Generalized IoU (MGIoU), a novel loss function that
overcomes these challenges by projecting structured convex shapes onto their
unique shape Normals to compute one-dimensional normalized GIoU. MGIoU offers a
simple, efficient, fully differentiable approximation strongly correlated with
IoU. We then extend MGIoU to MGIoU+ that supports optimizing unstructured
convex shapes. Together, MGIoU and MGIoU+ unify parametric shape optimization
across diverse applications. Experiments on standard benchmarks demonstrate
that MGIoU and MGIoU+ consistently outperform existing losses while reducing
loss computation latency by 10-40x. Additionally, MGIoU and MGIoU+ satisfy
metric properties and scale-invariance, ensuring robustness as an objective
function. We further propose MGIoU- for minimizing overlaps in tasks like
collision-free trajectory prediction. Code is available at
https://ldtho.github.io/MGIoU
comment: 8 pages
♻ ☆ Neuro-Symbolic Evaluation of Text-to-Video Models using Formal Verification
Recent advancements in text-to-video models such as Sora, Gen-3, MovieGen,
and CogVideoX are pushing the boundaries of synthetic video generation, with
adoption seen in fields like robotics, autonomous driving, and entertainment.
As these models become prevalent, various metrics and benchmarks have emerged
to evaluate the quality of the generated videos. However, these metrics
emphasize visual quality and smoothness, neglecting temporal fidelity and
text-to-video alignment, which are crucial for safety-critical applications. To
address this gap, we introduce NeuS-V, a novel synthetic video evaluation
metric that rigorously assesses text-to-video alignment using neuro-symbolic
formal verification techniques. Our approach first converts the prompt into a
formally defined Temporal Logic (TL) specification and translates the generated
video into an automaton representation. Then, it evaluates the text-to-video
alignment by formally checking the video automaton against the TL
specification. Furthermore, we present a dataset of temporally extended prompts
to evaluate state-of-the-art video generation models against our benchmark. We
find that NeuS-V demonstrates a higher correlation by over 5x with human
evaluations when compared to existing metrics. Our evaluation further reveals
that current video generation models perform poorly on these temporally complex
prompts, highlighting the need for future work in improving text-to-video
generation capabilities.