Computer Vision and Pattern Recognition 204
★ Efficient Diversity-Preserving Diffusion Alignment via Gradient-Informed GFlowNets
While one commonly trains large diffusion models by collecting datasets on
target downstream tasks, it is often desired to align and finetune pretrained
diffusion models on some reward functions that are either designed by experts
or learned from small-scale datasets. Existing methods for finetuning diffusion
models typically suffer from lack of diversity in generated samples, lack of
prior preservation, and/or slow convergence in finetuning. Inspired by recent
successes in generative flow networks (GFlowNets), a class of probabilistic
models that sample with the unnormalized density of a reward function, we
propose a novel GFlowNet method dubbed Nabla-GFlowNet (abbreviated as
$\nabla$-GFlowNet), the first GFlowNet method that leverages the rich signal in
reward gradients, together with an objective called $\nabla$-DB plus its
variant residual $\nabla$-DB designed for prior-preserving diffusion alignment.
We show that our proposed method achieves fast yet diversity- and
prior-preserving alignment of Stable Diffusion, a large-scale text-conditioned
image diffusion model, on different realistic reward functions.
comment: Technical Report (35 pages, 31 figures)
☆ Video Motion Transfer with Diffusion Transformers
We propose DiTFlow, a method for transferring the motion of a reference video
to a newly synthesized one, designed specifically for Diffusion Transformers
(DiT). We first process the reference video with a pre-trained DiT to analyze
cross-frame attention maps and extract a patch-wise motion signal called the
Attention Motion Flow (AMF). We guide the latent denoising process in an
optimization-based, training-free, manner by optimizing latents with our AMF
loss to generate videos reproducing the motion of the reference one. We also
apply our optimization strategy to transformer positional embeddings, granting
us a boost in zero-shot motion transfer capabilities. We evaluate DiTFlow
against recently published methods, outperforming all across multiple metrics
and human evaluation.
comment: Project page: https://ditflow.github.io/
☆ UniReal: Universal Image Generation and Editing via Learning Real-world Dynamics
Xi Chen, Zhifei Zhang, He Zhang, Yuqian Zhou, Soo Ye Kim, Qing Liu, Yijun Li, Jianming Zhang, Nanxuan Zhao, Yilin Wang, Hui Ding, Zhe Lin, Hengshuang Zhao
We introduce UniReal, a unified framework designed to address various image
generation and editing tasks. Existing solutions often vary by tasks, yet share
fundamental principles: preserving consistency between inputs and outputs while
capturing visual variations. Inspired by recent video generation models that
effectively balance consistency and variation across frames, we propose a
unifying approach that treats image-level tasks as discontinuous video
generation. Specifically, we treat varying numbers of input and output images
as frames, enabling seamless support for tasks such as image generation,
editing, customization, composition, etc. Although designed for image-level
tasks, we leverage videos as a scalable source for universal supervision.
UniReal learns world dynamics from large-scale videos, demonstrating advanced
capability in handling shadows, reflections, pose variation, and object
interaction, while also exhibiting emergent capability for novel applications.
comment: webpage: https://xavierchen34.github.io/UniReal-Page/
☆ From Slow Bidirectional to Fast Causal Video Generators
Current video diffusion models achieve impressive generation quality but
struggle in interactive applications due to bidirectional attention
dependencies. The generation of a single frame requires the model to process
the entire sequence, including the future. We address this limitation by
adapting a pretrained bidirectional diffusion transformer to a causal
transformer that generates frames on-the-fly. To further reduce latency, we
extend distribution matching distillation (DMD) to videos, distilling 50-step
diffusion model into a 4-step generator. To enable stable and high-quality
distillation, we introduce a student initialization scheme based on teacher's
ODE trajectories, as well as an asymmetric distillation strategy that
supervises a causal student model with a bidirectional teacher. This approach
effectively mitigates error accumulation in autoregressive generation, allowing
long-duration video synthesis despite training on short clips. Our model
supports fast streaming generation of high quality videos at 9.4 FPS on a
single GPU thanks to KV caching. Our approach also enables streaming
video-to-video translation, image-to-video, and dynamic prompting in a
zero-shot manner. We will release the code based on an open-source model in the
future.
comment: Project Page: https://causvid.github.io/
☆ PETALface: Parameter Efficient Transfer Learning for Low-resolution Face Recognition WACV 2025
Pre-training on large-scale datasets and utilizing margin-based loss
functions have been highly successful in training models for high-resolution
face recognition. However, these models struggle with low-resolution face
datasets, in which the faces lack the facial attributes necessary for
distinguishing different faces. Full fine-tuning on low-resolution datasets, a
naive method for adapting the model, yields inferior performance due to
catastrophic forgetting of pre-trained knowledge. Additionally the domain
difference between high-resolution (HR) gallery images and low-resolution (LR)
probe images in low resolution datasets leads to poor convergence for a single
model to adapt to both gallery and probe after fine-tuning. To this end, we
propose PETALface, a Parameter-Efficient Transfer Learning approach for
low-resolution face recognition. Through PETALface, we attempt to solve both
the aforementioned problems. (1) We solve catastrophic forgetting by leveraging
the power of parameter efficient fine-tuning(PEFT). (2) We introduce two
low-rank adaptation modules to the backbone, with weights adjusted based on the
input image quality to account for the difference in quality for the gallery
and probe images. To the best of our knowledge, PETALface is the first work
leveraging the powers of PEFT for low resolution face recognition. Extensive
experiments demonstrate that the proposed method outperforms full fine-tuning
on low-resolution datasets while preserving performance on high-resolution and
mixed-quality datasets, all while using only 0.48% of the parameters. Code:
https://kartik-3004.github.io/PETALface/
comment: Accepted to WACV 2025. Project Page:
https://kartik-3004.github.io/PETALface/
☆ From an Image to a Scene: Learning to Imagine the World from a Million 360 Videos NeurIPS 2024
Matthew Wallingford, Anand Bhattad, Aditya Kusupati, Vivek Ramanujan, Matt Deitke, Sham Kakade, Aniruddha Kembhavi, Roozbeh Mottaghi, Wei-Chiu Ma, Ali Farhadi
Three-dimensional (3D) understanding of objects and scenes play a key role in
humans' ability to interact with the world and has been an active area of
research in computer vision, graphics, and robotics. Large scale synthetic and
object-centric 3D datasets have shown to be effective in training models that
have 3D understanding of objects. However, applying a similar approach to
real-world objects and scenes is difficult due to a lack of large-scale data.
Videos are a potential source for real-world 3D data, but finding diverse yet
corresponding views of the same content has shown to be difficult at scale.
Furthermore, standard videos come with fixed viewpoints, determined at the time
of capture. This restricts the ability to access scenes from a variety of more
diverse and potentially useful perspectives. We argue that large scale 360
videos can address these limitations to provide: scalable corresponding frames
from diverse views. In this paper, we introduce 360-1M, a 360 video dataset,
and a process for efficiently finding corresponding frames from diverse
viewpoints at scale. We train our diffusion-based model, Odin, on 360-1M.
Empowered by the largest real-world, multi-view dataset to date, Odin is able
to freely generate novel views of real-world scenes. Unlike previous methods,
Odin can move the camera through the environment, enabling the model to infer
the geometry and layout of the scene. Additionally, we show improved
performance on standard novel view synthesis and 3D reconstruction benchmarks.
comment: NeurIPS 2024. For project page, see
https://mattwallingford.github.io/ODIN
☆ BiMediX2: Bio-Medical EXpert LMM for Diverse Medical Modalities
Sahal Shaji Mullappilly, Mohammed Irfan Kurpath, Sara Pieri, Saeed Yahya Alseiari, Shanavas Cholakkal, Khaled Aldahmani, Fahad Khan, Rao Anwer, Salman Khan, Timothy Baldwin, Hisham Cholakkal
This paper introduces BiMediX2, a bilingual (Arabic-English) Bio-Medical
EXpert Large Multimodal Model (LMM) with a unified architecture that integrates
text and visual modalities, enabling advanced image understanding and medical
applications. BiMediX2 leverages the Llama3.1 architecture and integrates text
and visual capabilities to facilitate seamless interactions in both English and
Arabic, supporting text-based inputs and multi-turn conversations involving
medical images. The model is trained on an extensive bilingual healthcare
dataset consisting of 1.6M samples of diverse medical interactions for both
text and image modalities, mixed in Arabic and English. We also propose the
first bilingual GPT-4o based medical LMM benchmark named BiMed-MBench. BiMediX2
is benchmarked on both text-based and image-based tasks, achieving
state-of-the-art performance across several medical benchmarks. It outperforms
recent state-of-the-art models in medical LLM evaluation benchmarks. Our model
also sets a new benchmark in multimodal medical evaluations with over 9%
improvement in English and over 20% in Arabic evaluations. Additionally, it
surpasses GPT-4 by around 9% in UPHILL factual accuracy evaluations and excels
in various medical Visual Question Answering, Report Generation, and Report
Summarization tasks. The project page including source code and the trained
model, is available at https://github.com/mbzuai-oryx/BiMediX2.
☆ Test-time Correction with Human Feedback: An Online 3D Detection System via Visual Prompting
This paper introduces Test-time Correction (TTC) system, a novel online 3D
detection system designated for online correction of test-time errors via human
feedback, to guarantee the safety of deployed autonomous driving systems.
Unlike well-studied offline 3D detectors frozen at inference, TTC explores the
capability of instant online error rectification. By leveraging user feedback
with interactive prompts at a frame, e.g., a simple click or draw of boxes, TTC
could immediately update the corresponding detection results for future
streaming inputs, even though the model is deployed with fixed parameters. This
enables autonomous driving systems to adapt to new scenarios immediately and
decrease deployment risks reliably without additional expensive training. To
achieve such TTC system, we equip existing 3D detectors with Online Adapter
(OA) module, a prompt-driven query generator for online correction. At the core
of OA module are visual prompts, images of missed object-of-interest for
guiding the corresponding detection and subsequent tracking. Those visual
prompts, belonging to missed objects through online inference, are maintained
by the visual prompt buffer for continuous error correction in subsequent
frames. By doing so, TTC consistently detects online missed objects and
immediately lowers driving risks. It achieves reliable, versatile, and adaptive
driving autonomy. Extensive experiments demonstrate significant gain on instant
error rectification over pre-trained 3D detectors, even in challenging
scenarios with limited labels, zero-shot detection, and adverse conditions. We
hope this work would inspire the community to investigate online rectification
systems for autonomous driving post-deployment. Code would be publicly shared.
☆ Learning Visual Generative Priors without Text
Shuailei Ma, Kecheng Zheng, Ying Wei, Wei Wu, Fan Lu, Yifei Zhang, Chen-wei Xie, Jiapeng Zhu, Yujun Shen
Although text-to-image (T2I) models have recently thrived as visual
generative priors, their reliance on high-quality text-image pairs makes
scaling up expensive. We argue that grasping the cross-modality alignment is
not a necessity for a sound visual generative prior, whose focus should be on
texture modeling. Such a philosophy inspires us to study image-to-image (I2I)
generation, where models can learn from in-the-wild images in a self-supervised
manner. We first develop a pure vision-based training framework, Lumos, and
confirm the feasibility and the scalability of learning I2I models. We then
find that, as an upstream task of T2I, our I2I model serves as a more
foundational visual prior and achieves on-par or better performance than
existing T2I models using only 1/10 text-image pairs for fine-tuning. We
further demonstrate the superiority of I2I priors over T2I priors on some
text-irrelevant visual generative tasks, like image-to-3D and image-to-video.
☆ Make-A-Texture: Fast Shape-Aware Texture Generation in 3 Seconds WACV 2025
Xiaoyu Xiang, Liat Sless Gorelik, Yuchen Fan, Omri Armstrong, Forrest Iandola, Yilei Li, Ita Lifshitz, Rakesh Ranjan
We present Make-A-Texture, a new framework that efficiently synthesizes
high-resolution texture maps from textual prompts for given 3D geometries. Our
approach progressively generates textures that are consistent across multiple
viewpoints with a depth-aware inpainting diffusion model, in an optimized
sequence of viewpoints determined by an automatic view selection algorithm.
A significant feature of our method is its remarkable efficiency, achieving a
full texture generation within an end-to-end runtime of just 3.07 seconds on a
single NVIDIA H100 GPU, significantly outperforming existing methods. Such an
acceleration is achieved by optimizations in the diffusion model and a
specialized backprojection method. Moreover, our method reduces the artifacts
in the backprojection phase, by selectively masking out non-frontal faces, and
internal faces of open-surfaced objects.
Experimental results demonstrate that Make-A-Texture matches or exceeds the
quality of other state-of-the-art methods. Our work significantly improves the
applicability and practicality of texture generation models for real-world 3D
content creation, including interactive creation and text-guided texture
editing.
comment: Accepted to WACV 2025
☆ Repurposing Pre-trained Video Diffusion Models for Event-based Video Interpolation
Jingxi Chen, Brandon Y. Feng, Haoming Cai, Tianfu Wang, Levi Burner, Dehao Yuan, Cornelia Fermuller, Christopher A. Metzler, Yiannis Aloimonos
Video Frame Interpolation aims to recover realistic missing frames between
observed frames, generating a high-frame-rate video from a low-frame-rate
video. However, without additional guidance, the large motion between frames
makes this problem ill-posed. Event-based Video Frame Interpolation (EVFI)
addresses this challenge by using sparse, high-temporal-resolution event
measurements as motion guidance. This guidance allows EVFI methods to
significantly outperform frame-only methods. However, to date, EVFI methods
have relied on a limited set of paired event-frame training data, severely
limiting their performance and generalization capabilities. In this work, we
overcome the limited data challenge by adapting pre-trained video diffusion
models trained on internet-scale datasets to EVFI. We experimentally validate
our approach on real-world EVFI datasets, including a new one that we
introduce. Our method outperforms existing methods and generalizes across
cameras far better than existing approaches.
☆ SynCamMaster: Synchronizing Multi-Camera Video Generation from Diverse Viewpoints
Jianhong Bai, Menghan Xia, Xintao Wang, Ziyang Yuan, Xiao Fu, Zuozhu Liu, Haoji Hu, Pengfei Wan, Di Zhang
Recent advancements in video diffusion models have shown exceptional
abilities in simulating real-world dynamics and maintaining 3D consistency.
This progress inspires us to investigate the potential of these models to
ensure dynamic consistency across various viewpoints, a highly desirable
feature for applications such as virtual filming. Unlike existing methods
focused on multi-view generation of single objects for 4D reconstruction, our
interest lies in generating open-world videos from arbitrary viewpoints,
incorporating 6 DoF camera poses. To achieve this, we propose a plug-and-play
module that enhances a pre-trained text-to-video model for multi-camera video
generation, ensuring consistent content across different viewpoints.
Specifically, we introduce a multi-view synchronization module to maintain
appearance and geometry consistency across these viewpoints. Given the scarcity
of high-quality training data, we design a hybrid training scheme that
leverages multi-camera images and monocular videos to supplement Unreal
Engine-rendered multi-camera videos. Furthermore, our method enables intriguing
extensions, such as re-rendering a video from novel viewpoints. We also release
a multi-view synchronized video dataset, named SynCamVideo-Dataset. Project
page: https://jianhongbai.github.io/SynCamMaster/.
comment: Project page: https://jianhongbai.github.io/SynCamMaster/
☆ 3DTrajMaster: Mastering 3D Trajectory for Multi-Entity Motion in Video Generation
Xiao Fu, Xian Liu, Xintao Wang, Sida Peng, Menghan Xia, Xiaoyu Shi, Ziyang Yuan, Pengfei Wan, Di Zhang, Dahua Lin
This paper aims to manipulate multi-entity 3D motions in video generation.
Previous methods on controllable video generation primarily leverage 2D control
signals to manipulate object motions and have achieved remarkable synthesis
results. However, 2D control signals are inherently limited in expressing the
3D nature of object motions. To overcome this problem, we introduce
3DTrajMaster, a robust controller that regulates multi-entity dynamics in 3D
space, given user-desired 6DoF pose (location and rotation) sequences of
entities. At the core of our approach is a plug-and-play 3D-motion grounded
object injector that fuses multiple input entities with their respective 3D
trajectories through a gated self-attention mechanism. In addition, we exploit
an injector architecture to preserve the video diffusion prior, which is
crucial for generalization ability. To mitigate video quality degradation, we
introduce a domain adaptor during training and employ an annealed sampling
strategy during inference. To address the lack of suitable training data, we
construct a 360-Motion Dataset, which first correlates collected 3D human and
animal assets with GPT-generated trajectory and then captures their motion with
12 evenly-surround cameras on diverse 3D UE platforms. Extensive experiments
show that 3DTrajMaster sets a new state-of-the-art in both accuracy and
generalization for controlling multi-entity 3D motions. Project page:
http://fuxiao0719.github.io/projects/3dtrajmaster
comment: Project Page & Code & Data:
http://fuxiao0719.github.io/projects/3dtrajmaster
☆ SAT: Spatial Aptitude Training for Multimodal Language Models
Arijit Ray, Jiafei Duan, Reuben Tan, Dina Bashkirova, Rose Hendrix, Kiana Ehsani, Aniruddha Kembhavi, Bryan A. Plummer, Ranjay Krishna, Kuo-Hao Zeng, Kate Saenko
Spatial perception is a fundamental component of intelligence. While many
studies highlight that large multimodal language models (MLMs) struggle to
reason about space, they only test for static spatial reasoning, such as
categorizing the relative positions of objects. Meanwhile, real-world
deployment requires dynamic capabilities like perspective-taking and egocentric
action recognition. As a roadmap to improving spatial intelligence, we
introduce SAT, Spatial Aptitude Training, which goes beyond static relative
object position questions to the more dynamic tasks. SAT contains 218K
question-answer pairs for 22K synthetic scenes across a training and testing
set. Generated using a photo-realistic physics engine, our dataset can be
arbitrarily scaled and easily extended to new actions, scenes, and 3D assets.
We find that even MLMs that perform relatively well on static questions
struggle to accurately answer dynamic spatial questions. Further, we show that
SAT instruction-tuning data improves not only dynamic spatial reasoning on SAT,
but also zero-shot performance on existing real-image spatial benchmarks:
$23\%$ on CVBench, $8\%$ on the harder BLINK benchmark, and $18\%$ on VSR. When
instruction-tuned on SAT, our 13B model matches larger proprietary MLMs like
GPT4-V and Gemini-3-1.0 in spatial reasoning. Our data/code is available at
http://arijitray1993.github.io/SAT/ .
comment: Project webpage: http://arijitray1993.github.io/SAT/
☆ PortraitTalk: Towards Customizable One-Shot Audio-to-Talking Face Generation
Audio-driven talking face generation is a challenging task in digital
communication. Despite significant progress in the area, most existing methods
concentrate on audio-lip synchronization, often overlooking aspects such as
visual quality, customization, and generalization that are crucial to producing
realistic talking faces. To address these limitations, we introduce a novel,
customizable one-shot audio-driven talking face generation framework, named
PortraitTalk. Our proposed method utilizes a latent diffusion framework
consisting of two main components: IdentityNet and AnimateNet. IdentityNet is
designed to preserve identity features consistently across the generated video
frames, while AnimateNet aims to enhance temporal coherence and motion
consistency. This framework also integrates an audio input with the reference
images, thereby reducing the reliance on reference-style videos prevalent in
existing approaches. A key innovation of PortraitTalk is the incorporation of
text prompts through decoupled cross-attention mechanisms, which significantly
expands creative control over the generated videos. Through extensive
experiments, including a newly developed evaluation metric, our model
demonstrates superior performance over the state-of-the-art methods, setting a
new standard for the generation of customizable realistic talking faces
suitable for real-world applications.
☆ On Motion Blur and Deblurring in Visual Place Recognition
Visual Place Recognition (VPR) in mobile robotics enables robots to localize
themselves by recognizing previously visited locations using visual data. While
the reliability of VPR methods has been extensively studied under conditions
such as changes in illumination, season, weather and viewpoint, the impact of
motion blur is relatively unexplored despite its relevance not only in rapid
motion scenarios but also in low-light conditions where longer exposure times
are necessary. Similarly, the role of image deblurring in enhancing VPR
performance under motion blur has received limited attention so far. This paper
bridges these gaps by introducing a new benchmark designed to evaluate VPR
performance under the influence of motion blur and image deblurring. The
benchmark includes three datasets that encompass a wide range of motion blur
intensities, providing a comprehensive platform for analysis. Experimental
results with several well-established VPR and image deblurring methods provide
new insights into the effects of motion blur and the potential improvements
achieved through deblurring. Building on these findings, the paper proposes
adaptive deblurring strategies for VPR, designed to effectively manage motion
blur in dynamic, real-world scenarios.
☆ Multi-Shot Character Consistency for Text-to-Video Generation
Text-to-video models have made significant strides in generating short video
clips from textual descriptions. Yet, a significant challenge remains:
generating several video shots of the same characters, preserving their
identity without hurting video quality, dynamics, and responsiveness to text
prompts. We present Video Storyboarding, a training-free method to enable
pretrained text-to-video models to generate multiple shots with consistent
characters, by sharing features between them. Our key insight is that
self-attention query features (Q) encode both motion and identity. This creates
a hard-to-avoid trade-off between preserving character identity and making
videos dynamic, when features are shared. To address this issue, we introduce a
novel query injection strategy that balances identity preservation and natural
motion retention. This approach improves upon naive consistency techniques
applied to videos, which often struggle to maintain this delicate equilibrium.
Our experiments demonstrate significant improvements in character consistency
across scenes while maintaining high-quality motion and text alignment. These
results offer insights into critical stages of video generation and the
interplay of structure and motion in video diffusion models.
comment: Project page:
https://research.nvidia.com/labs/par/video_storyboarding
☆ LoRA3D: Low-Rank Self-Calibration of 3D Geometric Foundation Models
Emerging 3D geometric foundation models, such as DUSt3R, offer a promising
approach for in-the-wild 3D vision tasks. However, due to the high-dimensional
nature of the problem space and scarcity of high-quality 3D data, these
pre-trained models still struggle to generalize to many challenging
circumstances, such as limited view overlap or low lighting. To address this,
we propose LoRA3D, an efficient self-calibration pipeline to
$\textit{specialize}$ the pre-trained models to target scenes using their own
multi-view predictions. Taking sparse RGB images as input, we leverage robust
optimization techniques to refine multi-view predictions and align them into a
global coordinate frame. In particular, we incorporate prediction confidence
into the geometric optimization process, automatically re-weighting the
confidence to better reflect point estimation accuracy. We use the calibrated
confidence to generate high-quality pseudo labels for the calibrating views and
use low-rank adaptation (LoRA) to fine-tune the models on the pseudo-labeled
data. Our method does not require any external priors or manual labels. It
completes the self-calibration process on a $\textbf{single standard GPU within
just 5 minutes}$. Each low-rank adapter requires only $\textbf{18MB}$ of
storage. We evaluated our method on $\textbf{more than 160 scenes}$ from the
Replica, TUM and Waymo Open datasets, achieving up to $\textbf{88% performance
improvement}$ on 3D reconstruction, multi-view pose estimation and novel-view
rendering.
☆ StyleMaster: Stylize Your Video with Artistic Generation and Translation
Style control has been popular in video generation models. Existing methods
often generate videos far from the given style, cause content leakage, and
struggle to transfer one video to the desired style. Our first observation is
that the style extraction stage matters, whereas existing methods emphasize
global style but ignore local textures. In order to bring texture features
while preventing content leakage, we filter content-related patches while
retaining style ones based on prompt-patch similarity; for global style
extraction, we generate a paired style dataset through model illusion to
facilitate contrastive learning, which greatly enhances the absolute style
consistency. Moreover, to fill in the image-to-video gap, we train a
lightweight motion adapter on still videos, which implicitly enhances
stylization extent, and enables our image-trained model to be seamlessly
applied to videos. Benefited from these efforts, our approach, StyleMaster, not
only achieves significant improvement in both style resemblance and temporal
coherence, but also can easily generalize to video style transfer with a gray
tile ControlNet. Extensive experiments and visualizations demonstrate that
StyleMaster significantly outperforms competitors, effectively generating
high-quality stylized videos that align with textual content and closely
resemble the style of reference images. Our project page is at
https://zixuan-ye.github.io/stylemaster
comment: Project webpage available at https://zixuan-ye.github.io/stylemaster
☆ Image Retrieval with Intra-Sweep Representation Learning for Neck Ultrasound Scanning Guidance
Purpose: Intraoperative ultrasound (US) can enhance real-time visualization
in transoral robotic surgery. The surgeon creates a mental map with a
pre-operative scan. Then, a surgical assistant performs freehand US scanning
during the surgery while the surgeon operates at the remote surgical console.
Communicating the target scanning plane in the surgeon's mental map is
difficult. Automatic image retrieval can help match intraoperative images to
preoperative scans, guiding the assistant to adjust the US probe toward the
target plane. Methods: We propose a self-supervised contrastive learning
approach to match intraoperative US views to a preoperative image database. We
introduce a novel contrastive learning strategy that leverages intra-sweep
similarity and US probe location to improve feature encoding. Additionally, our
model incorporates a flexible threshold to reject unsatisfactory matches.
Results: Our method achieves 92.30% retrieval accuracy on simulated data and
outperforms state-of-the-art temporal-based contrastive learning approaches.
Our ablation study demonstrates that using probe location in the optimization
goal improves image representation, suggesting that semantic information can be
extracted from probe location. We also present our approach on real patient
data to show the feasibility of the proposed US probe localization system
despite tissue deformation from tongue retraction. Conclusion: Our contrastive
learning method, which utilizes intra-sweep similarity and US probe location,
enhances US image representation learning. We also demonstrate the feasibility
of using our image retrieval method to provide neck US localization on real
patient US after tongue retraction.
comment: 12 pages, 5 figures
☆ GASP: Gaussian Avatars with Synthetic Priors SP
Jack Saunders, Charlie Hewitt, Yanan Jian, Marek Kowalski, Tadas Baltrusaitis, Yiye Chen, Darren Cosker, Virginia Estellers, Nicholas Gyde, Vinay P. Namboodiri, Benjamin E Lundell
Gaussian Splatting has changed the game for real-time photo-realistic
rendering. One of the most popular applications of Gaussian Splatting is to
create animatable avatars, known as Gaussian Avatars. Recent works have pushed
the boundaries of quality and rendering efficiency but suffer from two main
limitations. Either they require expensive multi-camera rigs to produce avatars
with free-view rendering, or they can be trained with a single camera but only
rendered at high quality from this fixed viewpoint. An ideal model would be
trained using a short monocular video or image from available hardware, such as
a webcam, and rendered from any view. To this end, we propose GASP: Gaussian
Avatars with Synthetic Priors. To overcome the limitations of existing
datasets, we exploit the pixel-perfect nature of synthetic data to train a
Gaussian Avatar prior. By fitting this prior model to a single photo or video
and fine-tuning it, we get a high-quality Gaussian Avatar, which supports
360$^\circ$ rendering. Our prior is only required for fitting, not inference,
enabling real-time application. Through our method, we obtain high-quality,
animatable Avatars from limited data which can be animated and rendered at
70fps on commercial hardware. See our project page
(https://microsoft.github.io/GASP/) for results.
comment: Project page: https://microsoft.github.io/GASP/
☆ SKIPNet: Spatial Attention Skip Connections for Enhanced Brain Tumor Classification
Early detection of brain tumors through magnetic resonance imaging (MRI) is
essential for timely treatment, yet access to diagnostic facilities remains
limited in remote areas. Gliomas, the most common primary brain tumors, arise
from the carcinogenesis of glial cells in the brain and spinal cord, with
glioblastoma patients having a median survival time of less than 14 months. MRI
serves as a non-invasive and effective method for tumor detection, but manual
segmentation of brain MRI scans has traditionally been a labor-intensive task
for neuroradiologists. Recent advancements in computer-aided design (CAD),
machine learning (ML), and deep learning (DL) offer promising solutions for
automating this process. This study proposes an automated deep learning model
for brain tumor detection and classification using MRI data. The model,
incorporating spatial attention, achieved 96.90% accuracy, enhancing the
aggregation of contextual information for better pattern recognition.
Experimental results demonstrate that the proposed approach outperforms
baseline models, highlighting its robustness and potential for advancing
automated MRI-based brain tumor analysis.
☆ STIV: Scalable Text and Image Conditioned Video Generation
Zongyu Lin, Wei Liu, Chen Chen, Jiasen Lu, Wenze Hu, Tsu-Jui Fu, Jesse Allardice, Zhengfeng Lai, Liangchen Song, Bowen Zhang, Cha Chen, Yiran Fei, Yifan Jiang, Lezhi Li, Yizhou Sun, Kai-Wei Chang, Yinfei Yang
The field of video generation has made remarkable advancements, yet there
remains a pressing need for a clear, systematic recipe that can guide the
development of robust and scalable models. In this work, we present a
comprehensive study that systematically explores the interplay of model
architectures, training recipes, and data curation strategies, culminating in a
simple and scalable text-image-conditioned video generation method, named STIV.
Our framework integrates image condition into a Diffusion Transformer (DiT)
through frame replacement, while incorporating text conditioning via a joint
image-text conditional classifier-free guidance. This design enables STIV to
perform both text-to-video (T2V) and text-image-to-video (TI2V) tasks
simultaneously. Additionally, STIV can be easily extended to various
applications, such as video prediction, frame interpolation, multi-view
generation, and long video generation, etc. With comprehensive ablation studies
on T2I, T2V, and TI2V, STIV demonstrate strong performance, despite its simple
design. An 8.7B model with 512 resolution achieves 83.1 on VBench T2V,
surpassing both leading open and closed-source models like CogVideoX-5B, Pika,
Kling, and Gen-3. The same-sized model also achieves a state-of-the-art result
of 90.1 on VBench I2V task at 512 resolution. By providing a transparent and
extensible recipe for building cutting-edge video generation models, we aim to
empower future research and accelerate progress toward more versatile and
reliable video generation solutions.
☆ ObjCtrl-2.5D: Training-free Object Control with Camera Poses
This study aims to achieve more precise and versatile object control in
image-to-video (I2V) generation. Current methods typically represent the
spatial movement of target objects with 2D trajectories, which often fail to
capture user intention and frequently produce unnatural results. To enhance
control, we present ObjCtrl-2.5D, a training-free object control approach that
uses a 3D trajectory, extended from a 2D trajectory with depth information, as
a control signal. By modeling object movement as camera movement, ObjCtrl-2.5D
represents the 3D trajectory as a sequence of camera poses, enabling object
motion control using an existing camera motion control I2V generation model
(CMC-I2V) without training. To adapt the CMC-I2V model originally designed for
global motion control to handle local object motion, we introduce a module to
isolate the target object from the background, enabling independent local
control. In addition, we devise an effective way to achieve more accurate
object control by sharing low-frequency warped latent within the object's
region across frames. Extensive experiments demonstrate that ObjCtrl-2.5D
significantly improves object control accuracy compared to training-free
methods and offers more diverse control capabilities than training-based
approaches using 2D trajectories, enabling complex effects like object
rotation. Code and results are available at
https://wzhouxiff.github.io/projects/ObjCtrl-2.5D/.
comment: Project Page: https://wzhouxiff.github.io/projects/ObjCtrl-2.5D/
☆ ACDiT: Interpolating Autoregressive Conditional Modeling and Diffusion Transformer
Jinyi Hu, Shengding Hu, Yuxuan Song, Yufei Huang, Mingxuan Wang, Hao Zhou, Zhiyuan Liu, Wei-Ying Ma, Maosong Sun
The recent surge of interest in comprehensive multimodal models has
necessitated the unification of diverse modalities. However, the unification
suffers from disparate methodologies. Continuous visual generation necessitates
the full-sequence diffusion-based approach, despite its divergence from the
autoregressive modeling in the text domain. We posit that autoregressive
modeling, i.e., predicting the future based on past deterministic experience,
remains crucial in developing both a visual generation model and a potential
unified multimodal model. In this paper, we explore an interpolation between
the autoregressive modeling and full-parameters diffusion to model visual
information. At its core, we present ACDiT, an Autoregressive blockwise
Conditional Diffusion Transformer, where the block size of diffusion, i.e., the
size of autoregressive units, can be flexibly adjusted to interpolate between
token-wise autoregression and full-sequence diffusion. ACDiT is easy to
implement, as simple as creating a Skip-Causal Attention Mask (SCAM) during
training. During inference, the process iterates between diffusion denoising
and autoregressive decoding that can make full use of KV-Cache. We verify the
effectiveness of ACDiT on image and video generation tasks. We also demonstrate
that benefitted from autoregressive modeling, ACDiT can be seamlessly used in
visual understanding tasks despite being trained on the diffusion objective.
The analysis of the trade-off between autoregressive modeling and diffusion
demonstrates the potential of ACDiT to be used in long-horizon visual
generation tasks. These strengths make it promising as the backbone of future
unified models.
☆ GEXIA: Granularity Expansion and Iterative Approximation for Scalable Multi-grained Video-language Learning
Yicheng Wang, Zhikang Zhang, Jue Wang, David Fan, Zhenlin Xu, Linda Liu, Xiang Hao, Vimal Bhat, Xinyu Li
In various video-language learning tasks, the challenge of achieving
cross-modality alignment with multi-grained data persists. We propose a method
to tackle this challenge from two crucial perspectives: data and modeling.
Given the absence of a multi-grained video-text pretraining dataset, we
introduce a Granularity EXpansion (GEX) method with Integration and Compression
operations to expand the granularity of a single-grained dataset. To better
model multi-grained data, we introduce an Iterative Approximation Module (IAM),
which embeds multi-grained videos and texts into a unified, low-dimensional
semantic space while preserving essential information for cross-modal
alignment. Furthermore, GEXIA is highly scalable with no restrictions on the
number of video-text granularities for alignment. We evaluate our work on three
categories of video tasks across seven benchmark datasets, showcasing
state-of-the-art or comparable performance. Remarkably, our model excels in
tasks involving long-form video understanding, even though the pretraining
dataset only contains short video clips.
☆ SimVS: Simulating World Inconsistencies for Robust View Synthesis
Alex Trevithick, Roni Paiss, Philipp Henzler, Dor Verbin, Rundi Wu, Hadi Alzayer, Ruiqi Gao, Ben Poole, Jonathan T. Barron, Aleksander Holynski, Ravi Ramamoorthi, Pratul P. Srinivasan
Novel-view synthesis techniques achieve impressive results for static scenes
but struggle when faced with the inconsistencies inherent to casual capture
settings: varying illumination, scene motion, and other unintended effects that
are difficult to model explicitly. We present an approach for leveraging
generative video models to simulate the inconsistencies in the world that can
occur during capture. We use this process, along with existing multi-view
datasets, to create synthetic data for training a multi-view harmonization
network that is able to reconcile inconsistent observations into a consistent
3D scene. We demonstrate that our world-simulation strategy significantly
outperforms traditional augmentation methods in handling real-world scene
variations, thereby enabling highly accurate static 3D reconstructions in the
presence of a variety of challenging inconsistencies. Project page:
https://alextrevithick.github.io/simvs
comment: Project page: https://alextrevithick.github.io/simvs
☆ Leveraging Content and Context Cues for Low-Light Image Enhancement IEEE
Low-light conditions have an adverse impact on machine cognition, limiting
the performance of computer vision systems in real life. Since low-light data
is limited and difficult to annotate, we focus on image processing to enhance
low-light images and improve the performance of any downstream task model,
instead of fine-tuning each of the models which can be prohibitively expensive.
We propose to improve the existing zero-reference low-light enhancement by
leveraging the CLIP model to capture image prior and for semantic guidance.
Specifically, we propose a data augmentation strategy to learn an image prior
via prompt learning, based on image sampling, to learn the image prior without
any need for paired or unpaired normal-light data. Next, we propose a semantic
guidance strategy that maximally takes advantage of existing low-light
annotation by introducing both content and context cues about the image
training patches. We experimentally show, in a qualitative study, that the
proposed prior and semantic guidance help to improve the overall image contrast
and hue, as well as improve background-foreground discrimination, resulting in
reduced over-saturation and noise over-amplification, common in related
zero-reference methods. As we target machine cognition, rather than rely on
assuming the correlation between human perception and downstream task
performance, we conduct and present an ablation study and comparison with
related zero-reference methods in terms of task-based performance across many
low-light datasets, including image classification, object and face detection,
showing the effectiveness of our proposed method.
comment: Accepted to the IEEE Transactions on Multimedia
☆ DriveMM: All-in-One Large Multimodal Model for Autonomous Driving
Large Multimodal Models (LMMs) have demonstrated exceptional comprehension
and interpretation capabilities in Autonomous Driving (AD) by incorporating
large language models. Despite the advancements, current data-driven AD
approaches tend to concentrate on a single dataset and specific tasks,
neglecting their overall capabilities and ability to generalize. To bridge
these gaps, we propose DriveMM, a general large multimodal model designed to
process diverse data inputs, such as images and multi-view videos, while
performing a broad spectrum of AD tasks, including perception, prediction, and
planning. Initially, the model undergoes curriculum pre-training to process
varied visual signals and perform basic visual comprehension and perception
tasks. Subsequently, we augment and standardize various AD-related datasets to
fine-tune the model, resulting in an all-in-one LMM for autonomous driving. To
assess the general capabilities and generalization ability, we conduct
evaluations on six public benchmarks and undertake zero-shot transfer on an
unseen dataset, where DriveMM achieves state-of-the-art performance across all
tasks. We hope DriveMM as a promising solution for future end-toend autonomous
driving applications in the real world.
☆ RADIO Amplified: Improved Baselines for Agglomerative Vision Foundation Models
Greg Heinrich, Mike Ranzinger, Hongxu, Yin, Yao Lu, Jan Kautz, Andrew Tao, Bryan Catanzaro, Pavlo Molchanov
Agglomerative models have recently emerged as a powerful approach to training
vision foundation models, leveraging multi-teacher distillation from existing
models such as CLIP, DINO, and SAM. This strategy enables the efficient
creation of robust models, combining the strengths of individual teachers while
significantly reducing computational and resource demands. In this paper, we
thoroughly analyze state-of-the-art agglomerative models, identifying critical
challenges including resolution mode shifts, teacher imbalance, idiosyncratic
teacher artifacts, and an excessive number of output tokens. To address these
issues, we propose several novel solutions: multi-resolution training, mosaic
augmentation, and improved balancing of teacher loss functions. Specifically,
in the context of Vision Language Models, we introduce a token compression
technique to maintain high-resolution information within a fixed token count.
We release our top-performing models, available in multiple scales (-B, -L, -H,
and -g), alongside inference code and pretrained weights.
☆ BATIS: Bootstrapping, Autonomous Testing, and Initialization System for Quantum Dot Devices
Tyler J. Kovach, Daniel Schug, M. A. Wolfe, E. R. MacQuarrie, Patrick J. Walsh, Jared Benson, Mark Friesen, M. A. Eriksson, Justyna P. Zwolak
Semiconductor quantum dot (QD) devices have become central to advancements in
spin-based quantum computing. As the complexity of QD devices grows, manual
tuning becomes increasingly infeasible, necessitating robust and scalable
autotuning solutions. Tuning large arrays of QD qubits depends on efficient
choices of automated protocols. Here, we introduce a bootstrapping, autonomous
testing, and initialization system (BATIS), an automated framework designed to
streamline QD device testing and initialization. BATIS navigates
high-dimensional gate voltage spaces, automating essential steps such as
leakage testing and gate characterization. The current channel formation
protocol follows a novel and scalable approach that requires a single
measurement regardless of the number of channels. Demonstrated at 1.3 K on a
quad-QD Si/Si$_x$Ge$_{1-x}$ device, BATIS eliminates the need for deep
cryogenic environments during initial device diagnostics, significantly
enhancing scalability and reducing setup times. By requiring minimal prior
knowledge of the device architecture, BATIS represents a platform-agnostic
solution, adaptable to various QD systems, which bridges a critical gap in QD
autotuning.
comment: 10 pages, 3 figures
☆ FiVA: Fine-grained Visual Attribute Dataset for Text-to-Image Diffusion Models NeurIPS 2024
Tong Wu, Yinghao Xu, Ryan Po, Mengchen Zhang, Guandao Yang, Jiaqi Wang, Ziwei Liu, Dahua Lin, Gordon Wetzstein
Recent advances in text-to-image generation have enabled the creation of
high-quality images with diverse applications. However, accurately describing
desired visual attributes can be challenging, especially for non-experts in art
and photography. An intuitive solution involves adopting favorable attributes
from the source images. Current methods attempt to distill identity and style
from source images. However, "style" is a broad concept that includes texture,
color, and artistic elements, but does not cover other important attributes
such as lighting and dynamics. Additionally, a simplified "style" adaptation
prevents combining multiple attributes from different sources into one
generated image. In this work, we formulate a more effective approach to
decompose the aesthetics of a picture into specific visual attributes, allowing
users to apply characteristics such as lighting, texture, and dynamics from
different images. To achieve this goal, we constructed the first fine-grained
visual attributes dataset (FiVA) to the best of our knowledge. This FiVA
dataset features a well-organized taxonomy for visual attributes and includes
around 1 M high-quality generated images with visual attribute annotations.
Leveraging this dataset, we propose a fine-grained visual attribute adaptation
framework (FiVA-Adapter), which decouples and adapts visual attributes from one
or more source images into a generated one. This approach enhances
user-friendly customization, allowing users to selectively apply desired
attributes to create images that meet their unique preferences and specific
content requirements.
comment: NeurIPS 2024 (Datasets and Benchmarks Track); Project page:
https://fiva-dataset.github.io/
☆ Proc-GS: Procedural Building Generation for City Assembly with 3D Gaussians
Yixuan Li, Xingjian Ran, Linning Xu, Tao Lu, Mulin Yu, Zhenzhi Wang, Yuanbo Xiangli, Dahua Lin, Bo Dai
Buildings are primary components of cities, often featuring repeated elements
such as windows and doors. Traditional 3D building asset creation is
labor-intensive and requires specialized skills to develop design rules. Recent
generative models for building creation often overlook these patterns, leading
to low visual fidelity and limited scalability. Drawing inspiration from
procedural modeling techniques used in the gaming and visual effects industry,
our method, Proc-GS, integrates procedural code into the 3D Gaussian Splatting
(3D-GS) framework, leveraging their advantages in high-fidelity rendering and
efficient asset management from both worlds. By manipulating procedural code,
we can streamline this process and generate an infinite variety of buildings.
This integration significantly reduces model size by utilizing shared
foundational assets, enabling scalable generation with precise control over
building assembly. We showcase the potential for expansive cityscape generation
while maintaining high rendering fidelity and precise control on both real and
synthetic cases.
comment: Project page: https://city-super.github.io/procgs/
☆ Analytical-Heuristic Modeling and Optimization for Low-Light Image Enhancement
Low-light image enhancement remains an open problem, and the new wave of
artificial intelligence is at the center of this problem. This work describes
the use of genetic algorithms for optimizing analytical models that can improve
the visualization of images with poor light. Genetic algorithms are part of
metaheuristic approaches, which proved helpful in solving challenging
optimization tasks. We propose two analytical methods combined with
optimization reasoning to approach a solution to the physical and computational
aspects of transforming dark images into visible ones. The experiments
demonstrate that the proposed approach ranks at the top among 26
state-of-the-art algorithms in the LOL benchmark. The results show evidence
that a simple genetic algorithm combined with analytical reasoning can defeat
the current mainstream in a challenging computer vision task through controlled
experiments and objective comparisons. This work opens interesting new research
avenues for the swarm and evolutionary computation community and others
interested in analytical and heuristic reasoning.
comment: 26 pages, 6 figures, 6 tables, 34 references
☆ TraSCE: Trajectory Steering for Concept Erasure
Anubhav Jain, Yuya Kobayashi, Takashi Shibuya, Yuhta Takida, Nasir Memon, Julian Togelius, Yuki Mitsufuji
Recent advancements in text-to-image diffusion models have brought them to
the public spotlight, becoming widely accessible and embraced by everyday
users. However, these models have been shown to generate harmful content such
as not-safe-for-work (NSFW) images. While approaches have been proposed to
erase such abstract concepts from the models, jail-breaking techniques have
succeeded in bypassing such safety measures. In this paper, we propose TraSCE,
an approach to guide the diffusion trajectory away from generating harmful
content. Our approach is based on negative prompting, but as we show in this
paper, conventional negative prompting is not a complete solution and can
easily be bypassed in some corner cases. To address this issue, we first
propose a modification of conventional negative prompting. Furthermore, we
introduce a localized loss-based guidance that enhances the modified negative
prompting technique by steering the diffusion trajectory. We demonstrate that
our proposed method achieves state-of-the-art results on various benchmarks in
removing harmful content including ones proposed by red teams; and erasing
artistic styles and objects. Our proposed approach does not require any
training, weight modifications, or training data (both image or prompt), making
it easier for model owners to erase new concepts.
☆ Bayesian Data Augmentation and Training for Perception DNN in Autonomous Aerial Vehicles
Ashik E Rasul, Humaira Tasnim, Hyung-Jin Yoon, Ayoosh Bansal, Duo Wang, Naira Hovakimyan, Lui Sha, Petros Voulgaris
Learning-based solutions have enabled incredible capabilities for autonomous
systems. Autonomous vehicles, both aerial and ground, rely on DNN for various
integral tasks, including perception. The efficacy of supervised learning
solutions hinges on the quality of the training data. Discrepancies between
training data and operating conditions result in faults that can lead to
catastrophic incidents. However, collecting vast amounts of context-sensitive
data, with broad coverage of possible operating environments, is prohibitively
difficult. Synthetic data generation techniques for DNN allow for the easy
exploration of diverse scenarios. However, synthetic data generation solutions
for aerial vehicles are still lacking.
This work presents a data augmentation framework for aerial vehicle's
perception training, leveraging photorealistic simulation integrated with
high-fidelity vehicle dynamics. Safe landing is a crucial challenge in the
development of autonomous air taxis, therefore, landing maneuver is chosen as
the focus of this work. With repeated simulations of landing in varying
scenarios we assess the landing performance of the VTOL type UAV and gather
valuable data. The landing performance is used as the objective function to
optimize the DNN through retraining. Given the high computational cost of DNN
retraining, we incorporated Bayesian Optimization in our framework that
systematically explores the data augmentation parameter space to retrain the
best-performing models. The framework allowed us to identify high-performing
data augmentation parameters that are consistently effective across different
landing scenarios. Utilizing the capabilities of this data augmentation
framework, we obtained a robust perception model. The model consistently
improved the perception-based landing success rate by at least 20% under
different lighting and weather conditions.
comment: To be published in AIAA SciTech 2025 Forum
☆ OmniDocBench: Benchmarking Diverse PDF Document Parsing with Comprehensive Annotations
Linke Ouyang, Yuan Qu, Hongbin Zhou, Jiawei Zhu, Rui Zhang, Qunshu Lin, Bin Wang, Zhiyuan Zhao, Man Jiang, Xiaomeng Zhao, Jin Shi, Fan Wu, Pei Chu, Minghao Liu, Zhenxiang Li, Chao Xu, Bo Zhang, Botian Shi, Zhongying Tu, Conghui He
Document content extraction is crucial in computer vision, especially for
meeting the high-quality data needs of large language models (LLMs) and
retrieval-augmented generation (RAG) technologies. However, current document
parsing methods suffer from significant limitations in terms of diversity and
comprehensive evaluation. To address these challenges, we introduce
OmniDocBench, a novel multi-source benchmark designed to advance automated
document content extraction. OmniDocBench includes a meticulously curated and
annotated high-quality evaluation dataset comprising nine diverse document
types, such as academic papers, textbooks, slides, among others. Our benchmark
provides a flexible and comprehensive evaluation framework with 19 layout
category labels and 14 attribute labels, enabling multi-level assessments
across entire datasets, individual modules, or specific data types. Using
OmniDocBench, we perform an exhaustive comparative analysis of existing modular
pipelines and multimodal end-to-end methods, highlighting their limitations in
handling document diversity and ensuring fair evaluation. OmniDocBench
establishes a robust, diverse, and fair evaluation standard for the document
content extraction field, offering crucial insights for future advancements and
fostering the development of document parsing technologies. The codes and
dataset is available in https://github.com/opendatalab/OmniDocBench.
☆ PVP: Polar Representation Boost for 3D Semantic Occupancy Prediction
Recently, polar coordinate-based representations have shown promise for 3D
perceptual tasks. Compared to Cartesian methods, polar grids provide a viable
alternative, offering better detail preservation in nearby spaces while
covering larger areas. However, they face feature distortion due to non-uniform
division. To address these issues, we introduce the Polar Voxel Occupancy
Predictor (PVP), a novel 3D multi-modal predictor that operates in polar
coordinates. PVP features two key design elements to overcome distortion: a
Global Represent Propagation (GRP) module that integrates global spatial data
into 3D volumes, and a Plane Decomposed Convolution (PD-Conv) that simplifies
3D distortions into 2D convolutions. These innovations enable PVP to outperform
existing methods, achieving significant improvements in mIoU and IoU metrics on
the OpenOccupancy dataset.
☆ ViewDelta: Text-Prompted Change Detection in Unaligned Images
Detecting changes between images is a fundamental problem in computer vision
with broad applications in situational awareness, infrastructure assessment,
environment monitoring, and industrial automation. Existing supervised models
are typically limited to detecting specific types of changes, necessitating
retraining for new tasks. To address these limitations with a single approach,
we propose a novel change detection method that is the first to utilize
unaligned images and textual prompts to output a binary segmentation of changes
relevant to user-provided text. Our architecture not only enables flexible
detection across diverse change detection use cases, but also yields
state-of-the art performance on established benchmarks. Additionally, we
release an accompanying dataset comprising of 100,311 pairs of images with text
prompts and the corresponding change detection labels. We demonstrate the
effectiveness of our method both quantitatively and qualitatively on datasets
with a wide variety of viewpoints in indoor, outdoor, street level, synthetic,
and satellite images.
☆ Faster and Better 3D Splatting via Group Training
3D Gaussian Splatting (3DGS) has emerged as a powerful technique for novel
view synthesis, demonstrating remarkable capability in high-fidelity scene
reconstruction through its Gaussian primitive representations. However, the
computational overhead induced by the massive number of primitives poses a
significant bottleneck to training efficiency. To overcome this challenge, we
propose Group Training, a simple yet effective strategy that organizes Gaussian
primitives into manageable groups, optimizing training efficiency and improving
rendering quality. This approach shows universal compatibility with existing
3DGS frameworks, including vanilla 3DGS and Mip-Splatting, consistently
achieving accelerated training while maintaining superior synthesis quality.
Extensive experiments reveal that our straightforward Group Training strategy
achieves up to 30% faster convergence and improved rendering quality across
diverse scenarios.
☆ RFL: Simplifying Chemical Structure Recognition with Ring-Free Language AAAI 2025
Qikai Chang, Mingjun Chen, Changpeng Pi, Pengfei Hu, Zhenrong Zhang, Jiefeng Ma, Jun Du, Baocai Yin, Jinshui Hu
The primary objective of Optical Chemical Structure Recognition is to
identify chemical structure images into corresponding markup sequences.
However, the complex two-dimensional structures of molecules, particularly
those with rings and multiple branches, present significant challenges for
current end-to-end methods to learn one-dimensional markup directly. To
overcome this limitation, we propose a novel Ring-Free Language (RFL), which
utilizes a divide-and-conquer strategy to describe chemical structures in a
hierarchical form. RFL allows complex molecular structures to be decomposed
into multiple parts, ensuring both uniqueness and conciseness while enhancing
readability. This approach significantly reduces the learning difficulty for
recognition models. Leveraging RFL, we propose a universal Molecular Skeleton
Decoder (MSD), which comprises a skeleton generation module that progressively
predicts the molecular skeleton and individual rings, along with a branch
classification module for predicting branch information. Experimental results
demonstrate that the proposed RFL and MSD can be applied to various mainstream
methods, achieving superior performance compared to state-of-the-art approaches
in both printed and handwritten scenarios. The code is available at
https://github.com/JingMog/RFL-MSD.
comment: 9 pages, 6 figures. Accepted by AAAI 2025
☆ Motion Artifact Removal in Pixel-Frequency Domain via Alternate Masks and Diffusion Model
Motion artifacts present in magnetic resonance imaging (MRI) can seriously
interfere with clinical diagnosis. Removing motion artifacts is a
straightforward solution and has been extensively studied. However, paired data
are still heavily relied on in recent works and the perturbations in
\textit{k}-space (frequency domain) are not well considered, which limits their
applications in the clinical field. To address these issues, we propose a novel
unsupervised purification method which leverages pixel-frequency information of
noisy MRI images to guide a pre-trained diffusion model to recover clean MRI
images. Specifically, considering that motion artifacts are mainly concentrated
in high-frequency components in \textit{k}-space, we utilize the low-frequency
components as the guide to ensure correct tissue textures. Additionally, given
that high-frequency and pixel information are helpful for recovering shape and
detail textures, we design alternate complementary masks to simultaneously
destroy the artifact structure and exploit useful information. Quantitative
experiments are performed on datasets from different tissues and show that our
method achieves superior performance on several metrics. Qualitative
evaluations with radiologists also show that our method provides better
clinical feedback. Our code is available at https://github.com/medcx/PFAD.
comment: 12 pages, 8 figures
☆ DiffSensei: Bridging Multi-Modal LLMs and Diffusion Models for Customized Manga Generation
Story visualization, the task of creating visual narratives from textual
descriptions, has seen progress with text-to-image generation models. However,
these models often lack effective control over character appearances and
interactions, particularly in multi-character scenes. To address these
limitations, we propose a new task: \textbf{customized manga generation} and
introduce \textbf{DiffSensei}, an innovative framework specifically designed
for generating manga with dynamic multi-character control. DiffSensei
integrates a diffusion-based image generator with a multimodal large language
model (MLLM) that acts as a text-compatible identity adapter. Our approach
employs masked cross-attention to seamlessly incorporate character features,
enabling precise layout control without direct pixel transfer. Additionally,
the MLLM-based adapter adjusts character features to align with panel-specific
text cues, allowing flexible adjustments in character expressions, poses, and
actions. We also introduce \textbf{MangaZero}, a large-scale dataset tailored
to this task, containing 43,264 manga pages and 427,147 annotated panels,
supporting the visualization of varied character interactions and movements
across sequential frames. Extensive experiments demonstrate that DiffSensei
outperforms existing models, marking a significant advancement in manga
generation by enabling text-adaptable character customization. The project page
is https://jianzongwu.github.io/projects/diffsensei/.
comment: The project page is https://jianzongwu.github.io/projects/diffsensei/
☆ Multimodal Contextualized Support for Enhancing Video Retrieval System
Current video retrieval systems, especially those used in competitions,
primarily focus on querying individual keyframes or images rather than encoding
an entire clip or video segment. However, queries often describe an action or
event over a series of frames, not a specific image. This results in
insufficient information when analyzing a single frame, leading to less
accurate query results. Moreover, extracting embeddings solely from images
(keyframes) does not provide enough information for models to encode
higher-level, more abstract insights inferred from the video. These models tend
to only describe the objects present in the frame, lacking a deeper
understanding. In this work, we propose a system that integrates the latest
methodologies, introducing a novel pipeline that extracts multimodal data, and
incorporate information from multiple frames within a video, enabling the model
to abstract higher-level information that captures latent meanings, focusing on
what can be inferred from the video clip, rather than just focusing on object
detection in one single image.
comment: 9 pages, 4 figures
☆ Mobile Video Diffusion
Video diffusion models have achieved impressive realism and controllability
but are limited by high computational demands, restricting their use on mobile
devices. This paper introduces the first mobile-optimized video diffusion
model. Starting from a spatio-temporal UNet from Stable Video Diffusion (SVD),
we reduce memory and computational cost by reducing the frame resolution,
incorporating multi-scale temporal representations, and introducing two novel
pruning schema to reduce the number of channels and temporal blocks.
Furthermore, we employ adversarial finetuning to reduce the denoising to a
single step. Our model, coined as MobileVD, is 523x more efficient (1817.2 vs.
4.34 TFLOPs) with a slight quality drop (FVD 149 vs. 171), generating latents
for a 14x512x256 px clip in 1.7 seconds on a Xiaomi-14 Pro. Our results are
available at https://qualcomm-ai-research.github.io/mobile-video-diffusion/
☆ Unlocking the Potential of Reverse Distillation for Anomaly Detection AAAI 2025
Knowledge Distillation (KD) is a promising approach for unsupervised Anomaly
Detection (AD). However, the student network's over-generalization often
diminishes the crucial representation differences between teacher and student
in anomalous regions, leading to detection failures. To addresses this problem,
the widely accepted Reverse Distillation (RD) paradigm designs the asymmetry
teacher and student, using an encoder as teacher and a decoder as student. Yet,
the design of RD does not ensure that the teacher encoder effectively
distinguishes between normal and abnormal features or that the student decoder
generates anomaly-free features. Additionally, the absence of skip connections
results in a loss of fine details during feature reconstruction. To address
these issues, we propose RD with Expert, which introduces a novel
Expert-Teacher-Student network for simultaneous distillation of both the
teacher encoder and student decoder. The added expert network enhances the
student's ability to generate normal features and optimizes the teacher's
differentiation between normal and abnormal features, reducing missed
detections. Additionally, Guided Information Injection is designed to filter
and transfer features from teacher to student, improving detail reconstruction
and minimizing false positives. Experiments on several benchmarks prove that
our method outperforms existing unsupervised AD methods under RD paradigm,
fully unlocking RD's potential.
comment: 18 pages, 14 figures, AAAI 2025
☆ Making the Flow Glow -- Robot Perception under Severe Lighting Conditions using Normalizing Flow Gradients
Modern robotic perception is highly dependent on neural networks. It is well
known that neural network-based perception can be unreliable in real-world
deployment, especially in difficult imaging conditions. Out-of-distribution
detection is commonly proposed as a solution for ensuring reliability in
real-world deployment. Previous work has shown that normalizing flow models can
be used for out-of-distribution detection to improve reliability of robotic
perception tasks. Specifically, camera parameters can be optimized with respect
to the likelihood output from a normalizing flow, which allows a perception
system to adapt to difficult vision scenarios. With this work we propose to use
the absolute gradient values from a normalizing flow, which allows the
perception system to optimize local regions rather than the whole image. By
setting up a table top picking experiment with exceptionally difficult lighting
conditions, we show that our method achieves a 60% higher success rate for an
object detection task compared to previous methods.
☆ ReCap: Better Gaussian Relighting with Cross-Environment Captures
Accurate 3D objects relighting in diverse unseen environments is crucial for
realistic virtual object placement. Due to the albedo-lighting ambiguity,
existing methods often fall short in producing faithful relights. Without
proper constraints, observed training views can be explained by numerous
combinations of lighting and material attributes, lacking physical
correspondence with the actual environment maps used for relighting. In this
work, we present ReCap, treating cross-environment captures as multi-task
target to provide the missing supervision that cuts through the entanglement.
Specifically, ReCap jointly optimizes multiple lighting representations that
share a common set of material attributes. This naturally harmonizes a coherent
set of lighting representations around the mutual material attributes,
exploiting commonalities and differences across varied object appearances. Such
coherence enables physically sound lighting reconstruction and robust material
estimation - both essential for accurate relighting. Together with a
streamlined shading function and effective post-processing, ReCap outperforms
the leading competitor by 3.4 dB in PSNR on an expanded relighting benchmark.
☆ Deep Joint Unrolling for Deblurring and Low-Light Image Enhancement (JUDE).pdf
Low-light and blurring issues are prevalent when capturing photos at night,
often due to the use of long exposure to address dim environments. Addressing
these joint problems can be challenging and error-prone if an end-to-end model
is trained without incorporating an appropriate physical model. In this paper,
we introduce JUDE, a Deep Joint Unrolling for Deblurring and Low-Light Image
Enhancement, inspired by the image physical model. Based on Retinex theory and
the blurring model, the low-light blurry input is iteratively deblurred and
decomposed, producing sharp low-light reflectance and illuminance through an
unrolling mechanism. Additionally, we incorporate various modules to estimate
the initial blur kernel, enhance brightness, and eliminate noise in the final
image. Comprehensive experiments on LOL-Blur and Real-LOL-Blur demonstrate that
our method outperforms existing techniques both quantitatively and
qualitatively.
comment: 10 pages
☆ KneeXNeT: An Ensemble-Based Approach for Knee Radiographic Evaluation
Knee osteoarthritis (OA) is the most common joint disorder and a leading
cause of disability. Diagnosing OA severity typically requires expert
assessment of X-ray images and is commonly based on the Kellgren-Lawrence
grading system, a time-intensive process. This study aimed to develop an
automated deep learning model to classify knee OA severity, reducing the need
for expert evaluation. First, we evaluated ten state-of-the-art deep learning
models, achieving a top accuracy of 0.69 with individual models. To address
class imbalance, we employed weighted sampling, improving accuracy to 0.70. We
further applied Smooth-GradCAM++ to visualize decision-influencing regions,
enhancing the explainability of the best-performing model. Finally, we
developed ensemble models using majority voting and a shallow neural network.
Our ensemble model, KneeXNet, achieved the highest accuracy of 0.72,
demonstrating its potential as an automated tool for knee OA assessment.
comment: 10 pages, 5 figures, accepted by MICAD 2024
☆ Hallucination Elimination and Semantic Enhancement Framework for Vision-Language Models in Traffic Scenarios
Large vision-language models (LVLMs) have demonstrated remarkable
capabilities in multimodal understanding and generation tasks. However, these
models occasionally generate hallucinatory texts, resulting in descriptions
that seem reasonable but do not correspond to the image. This phenomenon can
lead to wrong driving decisions of the autonomous driving system. To address
this challenge, this paper proposes HCOENet, a plug-and-play chain-of-thought
correction method designed to eliminate object hallucinations and generate
enhanced descriptions for critical objects overlooked in the initial response.
Specifically, HCOENet employs a cross-checking mechanism to filter entities and
directly extracts critical objects from the given image, enriching the
descriptive text. Experimental results on the POPE benchmark demonstrate that
HCOENet improves the F1-score of the Mini-InternVL-4B and mPLUG-Owl3 models by
12.58% and 4.28%, respectively. Additionally, qualitative results using images
collected in open campus scene further highlight the practical applicability of
the proposed method. Compared with the GPT-4o model, HCOENet achieves
comparable descriptive performance while significantly reducing costs. Finally,
two novel semantic understanding datasets, CODA_desc and nuScenes_desc, are
created for traffic scenarios to support future research. The codes and
datasets are publicly available at https://github.com/fjq-tongji/HCOENet.
☆ FireFlow: Fast Inversion of Rectified Flow for Image Semantic Editing
Though Rectified Flows (ReFlows) with distillation offers a promising way for
fast sampling, its fast inversion transforms images back to structured noise
for recovery and following editing remains unsolved. This paper introduces
FireFlow, a simple yet effective zero-shot approach that inherits the startling
capacity of ReFlow-based models (such as FLUX) in generation while extending
its capabilities to accurate inversion and editing in $8$ steps. We first
demonstrate that a carefully designed numerical solver is pivotal for ReFlow
inversion, enabling accurate inversion and reconstruction with the precision of
a second-order solver while maintaining the practical efficiency of a
first-order Euler method. This solver achieves a $3\times$ runtime speedup
compared to state-of-the-art ReFlow inversion and editing techniques, while
delivering smaller reconstruction errors and superior editing results in a
training-free mode. The code is available at
$\href{https://github.com/HolmesShuan/FireFlow}{this URL}$.
comment: technical report
☆ Stealthy and Robust Backdoor Attack against 3D Point Clouds through Additional Point Features
Recently, 3D backdoor attacks have posed a substantial threat to 3D Deep
Neural Networks (3D DNNs) designed for 3D point clouds, which are extensively
deployed in various security-critical applications. Although the existing 3D
backdoor attacks achieved high attack performance, they remain vulnerable to
preprocessing-based defenses (e.g., outlier removal and rotation augmentation)
and are prone to detection by human inspection. In pursuit of a more
challenging-to-defend and stealthy 3D backdoor attack, this paper introduces
the Stealthy and Robust Backdoor Attack (SRBA), which ensures robustness and
stealthiness through intentional design considerations. The key insight of our
attack involves applying a uniform shift to the additional point features of
point clouds (e.g., reflection intensity) widely utilized as part of inputs for
3D DNNs as the trigger. Without altering the geometric information of the point
clouds, our attack ensures visual consistency between poisoned and benign
samples, and demonstrate robustness against preprocessing-based defenses. In
addition, to automate our attack, we employ Bayesian Optimization (BO) to
identify the suitable trigger. Extensive experiments suggest that SRBA achieves
an attack success rate (ASR) exceeding 94% in all cases, and significantly
outperforms previous SOTA methods when multiple preprocessing operations are
applied during training.
☆ Enhancing 3D Object Detection in Autonomous Vehicles Based on Synthetic Virtual Environment Analysis
Vladislav Li, Ilias Siniosoglou, Thomai Karamitsou, Anastasios Lytos, Ioannis D. Moscholios, Sotirios K. Goudos, Jyoti S. Banerjee, Panagiotis Sarigiannidi, Vasileios Argyriou
Autonomous Vehicles (AVs) use natural images and videos as input to
understand the real world by overlaying and inferring digital elements,
facilitating proactive detection in an effort to assure safety. A crucial
aspect of this process is real-time, accurate object recognition through
automatic scene analysis. While traditional methods primarily concentrate on 2D
object detection, exploring 3D object detection, which involves projecting 3D
bounding boxes into the three-dimensional environment, holds significance and
can be notably enhanced using the AR ecosystem. This study examines an AI
model's ability to deduce 3D bounding boxes in the context of real-time scene
analysis while producing and evaluating the model's performance and processing
time, in the virtual domain, which is then applied to AVs. This work also
employs a synthetic dataset that includes artificially generated images
mimicking various environmental, lighting, and spatiotemporal states. This
evaluation is oriented in handling images featuring objects in diverse weather
conditions, captured with varying camera settings. These variations pose more
challenging detection and recognition scenarios, which the outcomes of this
work can help achieve competitive results under most of the tested conditions.
☆ EDGE: Unknown-aware Multi-label Learning by Energy Distribution Gap Expansion AAAI 2025
Multi-label Out-Of-Distribution (OOD) detection aims to discriminate the OOD
samples from the multi-label In-Distribution (ID) ones. Compared with its
multiclass counterpart, it is crucial to model the joint information among
classes. To this end, JointEnergy, which is a representative multi-label OOD
inference criterion, summarizes the logits of all the classes. However, we find
that JointEnergy can produce an imbalance problem in OOD detection, especially
when the model lacks enough discrimination ability. Specifically, we find that
the samples only related to minority classes tend to be classified as OOD
samples due to the ambiguous energy decision boundary. Besides, imbalanced
multi-label learning methods, originally designed for ID ones, would not be
suitable for OOD detection scenarios, even producing a serious negative
transfer effect. In this paper, we resort to auxiliary outlier exposure (OE)
and propose an unknown-aware multi-label learning framework to reshape the
uncertainty energy space layout. In this framework, the energy score is
separately optimized for tail ID samples and unknown samples, and the energy
distribution gap between them is expanded, such that the tail ID samples can
have a significantly larger energy score than the OOD ones. What's more, a
simple yet effective measure is designed to select more informative OE
datasets. Finally, comprehensive experimental results on multiple multi-label
and OOD datasets reveal the effectiveness of the proposed method.
comment: 9 pages, 5 figures, accepted by AAAI 2025
☆ ResGS: Residual Densification of 3D Gaussian for Efficient Detail Recovery
Recently, 3D Gaussian Splatting (3D-GS) has prevailed in novel view
synthesis, achieving high fidelity and efficiency. However, it often struggles
to capture rich details and complete geometry. Our analysis highlights a key
limitation of 3D-GS caused by the fixed threshold in densification, which
balances geometry coverage against detail recovery as the threshold varies. To
address this, we introduce a novel densification method, residual split, which
adds a downscaled Gaussian as a residual. Our approach is capable of adaptively
retrieving details and complementing missing geometry while enabling
progressive refinement. To further support this method, we propose a pipeline
named ResGS. Specifically, we integrate a Gaussian image pyramid for
progressive supervision and implement a selection scheme that prioritizes the
densification of coarse Gaussians over time. Extensive experiments demonstrate
that our method achieves SOTA rendering quality. Consistent performance
improvements can be achieved by applying our residual split on various 3D-GS
variants, underscoring its versatility and potential for broader application in
3D-GS-based applications.
☆ Stereo Hand-Object Reconstruction for Human-to-Robot Handover
Jointly estimating hand and object shape ensures the success of the robot
grasp in human-to-robot handovers. However, relying on hand-crafted prior
knowledge about the geometric structure of the object fails when generalising
to unseen objects, and depth sensors fail to detect transparent objects such as
drinking glasses. In this work, we propose a stereo-based method for
hand-object reconstruction that combines single-view reconstructions
probabilistically to form a coherent stereo reconstruction. We learn 3D shape
priors from a large synthetic hand-object dataset to ensure that our method is
generalisable, and use RGB inputs instead of depth as RGB can better capture
transparent objects. We show that our method achieves a lower object Chamfer
distance compared to existing RGB based hand-object reconstruction methods on
single view and stereo settings. We process the reconstructed hand-object shape
with a projection-based outlier removal step and use the output to guide a
human-to-robot handover pipeline with wide-baseline stereo RGB cameras. Our
hand-object reconstruction enables a robot to successfully receive a diverse
range of household objects from the human.
comment: 8 pages, 9 figures, 1 table
☆ Manta: Enhancing Mamba for Few-Shot Action Recognition of Long Sub-Sequence AAAI 2025
Wenbo Huang, Jinghui Zhang, Guang Li, Lei Zhang, Shuoyuan Wang, Fang Dong, Jiahui Jin, Takahiro Ogawa, Miki Haseyama
In few-shot action recognition~(FSAR), long sub-sequences of video naturally
express entire actions more effectively. However, the computational complexity
of mainstream Transformer-based methods limits their application. Recent Mamba
demonstrates efficiency in modeling long sequences, but directly applying Mamba
to FSAR overlooks the importance of local feature modeling and alignment.
Moreover, long sub-sequences within the same class accumulate intra-class
variance, which adversely impacts FSAR performance. To solve these challenges,
we propose a \underline{\textbf{M}}atryoshka M\underline{\textbf{A}}mba and
Co\underline{\textbf{N}}tras\underline{\textbf{T}}ive
Le\underline{\textbf{A}}rning framework~(\textbf{Manta}). Firstly, the
Matryoshka Mamba introduces multiple Inner Modules to enhance local feature
representation, rather than directly modeling global features. An Outer Module
captures dependencies of timeline between these local features for implicit
temporal alignment. Secondly, a hybrid contrastive learning paradigm, combining
both supervised and unsupervised methods, is designed to mitigate the negative
effects of intra-class variance accumulation. The Matryoshka Mamba and the
hybrid contrastive learning paradigm operate in parallel branches within Manta,
enhancing Mamba for FSAR of long sub-sequence. Manta achieves new
state-of-the-art performance on prominent benchmarks, including SSv2, Kinetics,
UCF101, and HMDB51. Extensive empirical studies prove that Manta significantly
improves FSAR of long sub-sequence from multiple perspectives. The code is
released at https://github.com/wenbohuang1002/Manta.
comment: Accepted by AAAI 2025
☆ BENet: A Cross-domain Robust Network for Detecting Face Forgeries via Bias Expansion and Latent-space Attention
Weihua Liu, Jianhua Qiu, Said Boumaraf, Chaochao lin, Pan liyuan, Lin Li, Mohammed Bennamoun, Naoufel Werghi
In response to the growing threat of deepfake technology, we introduce BENet,
a Cross-Domain Robust Bias Expansion Network. BENet enhances the detection of
fake faces by addressing limitations in current detectors related to variations
across different types of fake face generation techniques, where
``cross-domain" refers to the diverse range of these deepfakes, each considered
a separate domain. BENet's core feature is a bias expansion module based on
autoencoders. This module maintains genuine facial features while enhancing
differences in fake reconstructions, creating a reliable bias for detecting
fake faces across various deepfake domains. We also introduce a Latent-Space
Attention (LSA) module to capture inconsistencies related to fake faces at
different scales, ensuring robust defense against advanced deepfake techniques.
The enriched LSA feature maps are multiplied with the expanded bias to create a
versatile feature space optimized for subtle forgeries detection. To improve
its ability to detect fake faces from unknown sources, BENet integrates a
cross-domain detector module that enhances recognition accuracy by verifying
the facial domain during inference. We train our network end-to-end with a
novel bias expansion loss, adopted for the first time, in face forgery
detection. Extensive experiments covering both intra and cross-dataset
demonstrate BENet's superiority over current state-of-the-art solutions.
☆ DSFEC: Efficient and Deployable Deep Radar Object Detection
Deploying radar object detection models on resource-constrained edge devices
like the Raspberry Pi poses significant challenges due to the large size of the
model and the limited computational power and the memory of the Pi. In this
work, we explore the efficiency of Depthwise Separable Convolutions in radar
object detection networks and integrate them into our model. Additionally, we
introduce a novel Feature Enhancement and Compression (FEC) module to the
PointPillars feature encoder to further improve the model performance. With
these innovations, we propose the DSFEC-L model and its two versions, which
outperform the baseline (23.9 mAP of Car class, 20.72 GFLOPs) on nuScenes
dataset: 1). An efficient DSFEC-M model with a 14.6% performance improvement
and a 60% reduction in GFLOPs. 2). A deployable DSFEC-S model with a 3.76%
performance improvement and a remarkable 78.5% reduction in GFLOPs. Despite
marginal performance gains, our deployable model achieves an impressive 74.5%
reduction in runtime on the Raspberry Pi compared to the baseline.
☆ Explainability of Deep Learning-Based Plant Disease Classifiers Through Automated Concept Identification
While deep learning has significantly advanced automatic plant disease
detection through image-based classification, improving model explainability
remains crucial for reliable disease detection. In this study, we apply the
Automated Concept-based Explanation (ACE) method to plant disease
classification using the widely adopted InceptionV3 model and the PlantVillage
dataset. ACE automatically identifies the visual concepts found in the image
data and provides insights about the critical features influencing the model
predictions. This approach reveals both effective disease-related patterns and
incidental biases, such as those from background or lighting that can
compromise model robustness. Through systematic experiments, ACE helped us to
identify relevant features and pinpoint areas for targeted model improvement.
Our findings demonstrate the potential of ACE to improve the explainability of
plant disease classification based on deep learning, which is essential for
producing transparent tools for plant disease management in agriculture.
☆ Learning Self-Supervised Audio-Visual Representations for Sound Recommendations
We propose a novel self-supervised approach for learning audio and visual
representations from unlabeled videos, based on their correspondence. The
approach uses an attention mechanism to learn the relative importance of
convolutional features extracted at different resolutions from the audio and
visual streams and uses the attention features to encode the audio and visual
input based on their correspondence. We evaluated the representations learned
by the model to classify audio-visual correlation as well as to recommend sound
effects for visual scenes. Our results show that the representations generated
by the attention model improves the correlation accuracy compared to the
baseline, by 18% and the recommendation accuracy by 10% for VGG-Sound, which is
a public video dataset. Additionally, audio-visual representations learned by
training the attention model with cross-modal contrastive learning further
improves the recommendation performance, based on our evaluation using
VGG-Sound and a more challenging dataset consisting of gameplay video
recordings.
comment: Published in the Proceedings of the International Symposium on Visual
Computing, 2021 https://dl.acm.org/doi/10.1007/978-3-030-90436-4_10
☆ Benchmarking Vision-Based Object Tracking for USVs in Complex Maritime Environments IEEE
Vision-based target tracking is crucial for unmanned surface vehicles (USVs)
to perform tasks such as inspection, monitoring, and surveillance. However,
real-time tracking in complex maritime environments is challenging due to
dynamic camera movement, low visibility, and scale variation. Typically, object
detection methods combined with filtering techniques are commonly used for
tracking, but they often lack robustness, particularly in the presence of
camera motion and missed detections. Although advanced tracking methods have
been proposed recently, their application in maritime scenarios is limited. To
address this gap, this study proposes a vision-guided object-tracking framework
for USVs, integrating state-of-the-art tracking algorithms with low-level
control systems to enable precise tracking in dynamic maritime environments. We
benchmarked the performance of seven distinct trackers, developed using
advanced deep learning techniques such as Siamese Networks and Transformers, by
evaluating them on both simulated and real-world maritime datasets. In
addition, we evaluated the robustness of various control algorithms in
conjunction with these tracking systems. The proposed framework was validated
through simulations and real-world sea experiments, demonstrating its
effectiveness in handling dynamic maritime conditions. The results show that
SeqTrack, a Transformer-based tracker, performed best in adverse conditions,
such as dust storms. Among the control algorithms evaluated, the linear
quadratic regulator controller (LQR) demonstrated the most robust and smooth
control, allowing for stable tracking of the USV.
comment: submitted to IEEE Access
☆ Post-Training Non-Uniform Quantization for Convolutional Neural Networks
Despite the success of CNN models on a variety of Image classification and
segmentation tasks, their extensive computational and storage demands pose
considerable challenges for real-world deployment on resource constrained
devices. Quantization is one technique that aims to alleviate these large
storage requirements and speed up the inference process by reducing the
precision of model parameters to lower-bit representations. In this paper, we
introduce a novel post-training quantization method for model weights. Our
method finds optimal clipping thresholds and scaling factors along with
mathematical guarantees that our method minimizes quantization noise. Empirical
results on Real World Datasets demonstrate that our quantization scheme
significantly reduces model size and computational requirements while
preserving model accuracy.
☆ Enhanced MRI Representation via Cross-series Masking
Magnetic resonance imaging (MRI) is indispensable for diagnosing and planning
treatment in various medical conditions due to its ability to produce
multi-series images that reveal different tissue characteristics. However,
integrating these diverse series to form a coherent analysis presents
significant challenges, such as differing spatial resolutions and contrast
patterns meanwhile requiring extensive annotated data, which is scarce in
clinical practice. Due to these issues, we introduce a novel Cross-Series
Masking (CSM) Strategy for effectively learning MRI representation in a
self-supervised manner. Specifically, CSM commences by randomly sampling a
subset of regions and series, which are then strategically masked. In the
training process, the cross-series representation is learned by utilizing the
unmasked data to reconstruct the masked portions. This process not only
integrates information across different series but also facilitates the ability
to model both intra-series and inter-series correlations and complementarities.
With the learned representation, the downstream tasks like segmentation and
classification are also enhanced. Taking brain tissue segmentation, breast
tumor benign/malignant classification, and prostate cancer diagnosis as
examples, our method achieves state-of-the-art performance on both public and
in-house datasets.
☆ LOGen: Toward Lidar Object Generation by Point Diffusion
A common strategy to improve lidar segmentation results on rare semantic
classes consists of pasting objects from one lidar scene into another. While
this augments the quantity of instances seen at training time and varies their
context, the instances fundamentally remain the same. In this work, we explore
how to enhance instance diversity using a lidar object generator. We introduce
a novel diffusion-based method to produce lidar point clouds of dataset
objects, including reflectance, and with an extensive control of the generation
via conditioning information. Our experiments on nuScenes show the quality of
our object generations measured with new 3D metrics developed to suit lidar
objects.
comment: Project web page: https://nerminsamet.github.io/logen/
☆ Label up: Learning Pulmonary Embolism Segmentation from Image Level Annotation through Model Explainability
Pulmonary Embolisms (PE) are a leading cause of cardiovascular death.
Computed tomographic pulmonary angiography (CTPA) stands as the gold standard
for diagnosing pulmonary embolisms (PE) and there has been a lot of interest in
developing AI-based models for assisting in PE diagnosis. Performance of these
algorithms has been hindered by the scarcity of annotated data, especially
those with fine-grained delineation of the thromboembolic burden. In this paper
we attempt to address this issue by introducing a weakly supervised learning
pipeline, that leverages model explainability to generate fine-grained (pixel
level) masks for embolisms starting from more coarse-grained (binary, image
level) PE annotations. Furthermore, we show that training models using the
automatically generated pixel annotations yields good PE localization
performance. We demonstrate the effectiveness of our pipeline on the
large-scale, multi-center RSPECT augmented dataset for PE detection and
localization.
☆ CADSpotting: Robust Panoptic Symbol Spotting on Large-Scale CAD Drawings
Jiazuo Mu, Fuyi Yang, Yanshun Zhang, Junxiong Zhang, Yongjian Luo, Lan Xu, Yujiao Shi, Jingyi Yu, Yingliang Zhang
We introduce CADSpotting, an efficient method for panoptic symbol spotting in
large-scale architectural CAD drawings. Existing approaches struggle with the
diversity of symbols, scale variations, and overlapping elements in CAD
designs. CADSpotting overcomes these challenges by representing each primitive
with dense points instead of a single primitive point, described by essential
attributes like coordinates and color. Building upon a unified 3D point cloud
model for joint semantic, instance, and panoptic segmentation, CADSpotting
learns robust feature representations. To enable accurate segmentation in
large, complex drawings, we further propose a novel Sliding Window Aggregation
(SWA) technique, combining weighted voting and Non-Maximum Suppression (NMS).
Moreover, we introduce a large-scale CAD dataset named LS-CAD to support our
experiments. Each floorplan in LS-CAD has an average coverage of 1,000 square
meter(versus 100 square meter in the existing dataset), providing a valuable
benchmark for symbol spotting research. Experimental results on FloorPlanCAD
and LS-CAD datasets demonstrate that CADSpotting outperforms existing methods,
showcasing its robustness and scalability for real-world CAD applications.
☆ StoryWeaver: A Unified World Model for Knowledge-Enhanced Story Character Customization
Story visualization has gained increasing attention in artificial
intelligence. However, existing methods still struggle with maintaining a
balance between character identity preservation and text-semantics alignment,
largely due to a lack of detailed semantic modeling of the story scene. To
tackle this challenge, we propose a novel knowledge graph, namely Character
Graph (\textbf{CG}), which comprehensively represents various story-related
knowledge, including the characters, the attributes related to characters, and
the relationship between characters. We then introduce StoryWeaver, an image
generator that achieve Customization via Character Graph (\textbf{C-CG}),
capable of consistent story visualization with rich text semantics. To further
improve the multi-character generation performance, we incorporate
knowledge-enhanced spatial guidance (\textbf{KE-SG}) into StoryWeaver to
precisely inject character semantics into generation. To validate the
effectiveness of our proposed method, extensive experiments are conducted using
a new benchmark called TBC-Bench. The experiments confirm that our StoryWeaver
excels not only in creating vivid visual story plots but also in accurately
conveying character identities across various scenarios with considerable
storage efficiency, \emph{e.g.}, achieving an average increase of +9.03\%
DINO-I and +13.44\% CLIP-T. Furthermore, ablation experiments are conducted to
verify the superiority of the proposed module. Codes and datasets are released
at https://github.com/Aria-Zhangjl/StoryWeaver.
☆ PRM: Photometric Stereo based Large Reconstruction Model
We propose PRM, a novel photometric stereo based large reconstruction model
to reconstruct high-quality meshes with fine-grained local details. Unlike
previous large reconstruction models that prepare images under fixed and simple
lighting as both input and supervision, PRM renders photometric stereo images
by varying materials and lighting for the purposes, which not only improves the
precise local details by providing rich photometric cues but also increases the
model robustness to variations in the appearance of input images. To offer
enhanced flexibility of images rendering, we incorporate a real-time
physically-based rendering (PBR) method and mesh rasterization for online
images rendering. Moreover, in employing an explicit mesh as our 3D
representation, PRM ensures the application of differentiable PBR, which
supports the utilization of multiple photometric supervisions and better models
the specular color for high-quality geometry optimization. Our PRM leverages
photometric stereo images to achieve high-quality reconstructions with
fine-grained local details, even amidst sophisticated image appearances.
Extensive experiments demonstrate that PRM significantly outperforms other
models.
comment: https://wenhangge.github.io/PRM/
☆ ITPNet: Towards Instantaneous Trajectory Prediction for Autonomous Driving
Trajectory prediction of agents is crucial for the safety of autonomous
vehicles, whereas previous approaches usually rely on sufficiently
long-observed trajectory to predict the future trajectory of the agents.
However, in real-world scenarios, it is not realistic to collect adequate
observed locations for moving agents, leading to the collapse of most
prediction models. For instance, when a moving car suddenly appears and is very
close to an autonomous vehicle because of the obstruction, it is quite
necessary for the autonomous vehicle to quickly and accurately predict the
future trajectories of the car with limited observed trajectory locations. In
light of this, we focus on investigating the task of instantaneous trajectory
prediction, i.e., two observed locations are available during inference. To
this end, we propose a general and plug-and-play instantaneous trajectory
prediction approach, called ITPNet. Specifically, we propose a backward
forecasting mechanism to reversely predict the latent feature representations
of unobserved historical trajectories of the agent based on its two observed
locations and then leverage them as complementary information for future
trajectory prediction. Meanwhile, due to the inevitable existence of noise and
redundancy in the predicted latent feature representations, we further devise a
Noise Redundancy Reduction Former, aiming at to filter out noise and redundancy
from unobserved trajectories and integrate the filtered features and observed
features into a compact query for future trajectory predictions. In essence,
ITPNet can be naturally compatible with existing trajectory prediction models,
enabling them to gracefully handle the case of instantaneous trajectory
prediction. Extensive experiments on the Argoverse and nuScenes datasets
demonstrate ITPNet outperforms the baselines, and its efficacy with different
trajectory prediction models.
☆ Efficient 3D Recognition with Event-driven Spike Sparse Convolution AAAI 2025
Spiking Neural Networks (SNNs) provide an energy-efficient way to extract 3D
spatio-temporal features. Point clouds are sparse 3D spatial data, which
suggests that SNNs should be well-suited for processing them. However, when
applying SNNs to point clouds, they often exhibit limited performance and fewer
application scenarios. We attribute this to inappropriate preprocessing and
feature extraction methods. To address this issue, we first introduce the Spike
Voxel Coding (SVC) scheme, which encodes the 3D point clouds into a sparse
spike train space, reducing the storage requirements and saving time on point
cloud preprocessing. Then, we propose a Spike Sparse Convolution (SSC) model
for efficiently extracting 3D sparse point cloud features. Combining SVC and
SSC, we design an efficient 3D SNN backbone (E-3DSNN), which is friendly with
neuromorphic hardware. For instance, SSC can be implemented on neuromorphic
chips with only minor modifications to the addressing function of vanilla spike
convolution. Experiments on ModelNet40, KITTI, and Semantic KITTI datasets
demonstrate that E-3DSNN achieves state-of-the-art (SOTA) results with
remarkable efficiency. Notably, our E-3DSNN (1.87M) obtained 91.7\% top-1
accuracy on ModelNet40, surpassing the current best SNN baselines (14.3M) by
3.0\%. To our best knowledge, it is the first direct training 3D SNN backbone
that can simultaneously handle various 3D computer vision tasks (e.g.,
classification, detection, and segmentation) with an event-driven nature. Code
is available: https://github.com/bollossom/E-3DSNN/.
comment: Accepted by AAAI 2025
☆ Fusion Embedding for Pose-Guided Person Image Synthesis with Diffusion Model
Pose-Guided Person Image Synthesis (PGPIS) aims to synthesize high-quality
person images corresponding to target poses while preserving the appearance of
the source image. Recently, PGPIS methods that use diffusion models have
achieved competitive performance. Most approaches involve extracting
representations of the target pose and source image and learning their
relationships in the generative model's training process. This approach makes
it difficult to learn the semantic relationships between the input and target
images and complicates the model structure needed to enhance generation
results. To address these issues, we propose Fusion embedding for PGPIS using a
Diffusion Model (FPDM). Inspired by the successful application of pre-trained
CLIP models in text-to-image diffusion models, our method consists of two
stages. The first stage involves training the fusion embedding of the source
image and target pose to align with the target image's embedding. In the second
stage, the generative model uses this fusion embedding as a condition to
generate the target image. We applied the proposed method to the benchmark
datasets DeepFashion and RWTH-PHOENIX-Weather 2014T, and conducted both
quantitative and qualitative evaluations, demonstrating state-of-the-art (SOTA)
performance. An ablation study of the model structure showed that even a model
using only the second stage achieved performance close to the other PGPIS SOTA
models. The code is available at https://github.com/dhlee-work/FPDM.
☆ CoMA: Compositional Human Motion Generation with Multi-modal Agents
Shanlin Sun, Gabriel De Araujo, Jiaqi Xu, Shenghan Zhou, Hanwen Zhang, Ziheng Huang, Chenyu You, Xiaohui Xie
3D human motion generation has seen substantial advancement in recent years.
While state-of-the-art approaches have improved performance significantly, they
still struggle with complex and detailed motions unseen in training data,
largely due to the scarcity of motion datasets and the prohibitive cost of
generating new training examples. To address these challenges, we introduce
CoMA, an agent-based solution for complex human motion generation, editing, and
comprehension. CoMA leverages multiple collaborative agents powered by large
language and vision models, alongside a mask transformer-based motion generator
featuring body part-specific encoders and codebooks for fine-grained control.
Our framework enables generation of both short and long motion sequences with
detailed instructions, text-guided motion editing, and self-correction for
improved quality. Evaluations on the HumanML3D dataset demonstrate competitive
performance against state-of-the-art methods. Additionally, we create a set of
context-rich, compositional, and long text prompts, where user studies show our
method significantly outperforms existing approaches.
comment: Project Page: https://gabrie-l.github.io/coma-page/
☆ FaceX: Understanding Face Attribute Classifiers through Summary Model Explanations
EXplainable Artificial Intelligence (XAI) approaches are widely applied for
identifying fairness issues in Artificial Intelligence (AI) systems. However,
in the context of facial analysis, existing XAI approaches, such as pixel
attribution methods, offer explanations for individual images, posing
challenges in assessing the overall behavior of a model, which would require
labor-intensive manual inspection of a very large number of instances and
leaving to the human the task of drawing a general impression of the model
behavior from the individual outputs. Addressing this limitation, we introduce
FaceX, the first method that provides a comprehensive understanding of face
attribute classifiers through summary model explanations. Specifically, FaceX
leverages the presence of distinct regions across all facial images to compute
a region-level aggregation of model activations, allowing for the visualization
of the model's region attribution across 19 predefined regions of interest in
facial images, such as hair, ears, or skin. Beyond spatial explanations, FaceX
enhances interpretability by visualizing specific image patches with the
highest impact on the model's decisions for each facial region within a test
benchmark. Through extensive evaluation in various experimental setups,
including scenarios with or without intentional biases and mitigation efforts
on four benchmarks, namely CelebA, FairFace, CelebAMask-HQ, and Racial Faces in
the Wild, FaceX demonstrates high effectiveness in identifying the models'
biases.
☆ Compression of Large-Scale 3D Point Clouds Based on Joint Optimization of Point Sampling and Feature Extraction
Large-scale 3D point clouds (LS3DPC) obtained by LiDAR scanners require huge
storage space and transmission bandwidth due to a large amount of data. The
existing methods of LS3DPC compression separately perform rule-based point
sampling and learnable feature extraction, and hence achieve limited
compression performance. In this paper, we propose a fully end-to-end training
framework for LS3DPC compression where the point sampling and the feature
extraction are jointly optimized in terms of the rate and distortion losses. To
this end, we first make the point sampling module to be trainable such that an
optimal position of the downsampled point is estimated via aggregation with
learnable weights. We also develop a reliable point reconstruction scheme that
adaptively aggregates the expanded candidate points to refine the positions of
upsampled points. Experimental results evaluated on the SemanticKITTI and
nuScenes datasets show that the proposed method achieves significantly higher
compression ratios compared with the existing state-of-the-art methods.
comment: 10 pages, 10 figures, 1 table
☆ EventSplat: 3D Gaussian Splatting from Moving Event Cameras for Real-time Rendering
We introduce a method for using event camera data in novel view synthesis via
Gaussian Splatting. Event cameras offer exceptional temporal resolution and a
high dynamic range. Leveraging these capabilities allows us to effectively
address the novel view synthesis challenge in the presence of fast camera
motion. For initialization of the optimization process, our approach uses prior
knowledge encoded in an event-to-video model. We also use spline interpolation
for obtaining high quality poses along the event camera trajectory. This
enhances the reconstruction quality from fast-moving cameras while overcoming
the computational limitations traditionally associated with event-based Neural
Radiance Field (NeRF) methods. Our experimental evaluation demonstrates that
our results achieve higher visual fidelity and better performance than existing
event-based NeRF approaches while being an order of magnitude faster to render.
☆ Image Classification Using Singular Value Decomposition and Optimization
This study investigates the applicability of Singular Value Decomposition for
the image classification of specific breeds of cats and dogs using fur color as
the primary identifying feature. Sequential Quadratic Programming (SQP) is
employed to construct optimally weighted templates. The proposed method
achieves 69% accuracy using the Frobenius norm at rank 10. The results
partially validate the assumption that dominant features, such as fur color,
can be effectively captured through low-rank approximations. However, the
accuracy suggests that additional features or methods may be required for more
robust classification, highlighting the trade-off between simplicity and
performance in resource-constrained environments.
comment: 10 pages, 7 figures
☆ Backdoor Attacks against No-Reference Image Quality Assessment Models via A Scalable Trigger AAAI 2025
No-Reference Image Quality Assessment (NR-IQA), responsible for assessing the
quality of a single input image without using any reference, plays a critical
role in evaluating and optimizing computer vision systems, e.g., low-light
enhancement. Recent research indicates that NR-IQA models are susceptible to
adversarial attacks, which can significantly alter predicted scores with
visually imperceptible perturbations. Despite revealing vulnerabilities, these
attack methods have limitations, including high computational demands,
untargeted manipulation, limited practical utility in white-box scenarios, and
reduced effectiveness in black-box scenarios. To address these challenges, we
shift our focus to another significant threat and present a novel
poisoning-based backdoor attack against NR-IQA (BAIQA), allowing the attacker
to manipulate the IQA model's output to any desired target value by simply
adjusting a scaling coefficient $\alpha$ for the trigger. We propose to inject
the trigger in the discrete cosine transform (DCT) domain to improve the local
invariance of the trigger for countering trigger diminishment in NR-IQA models
due to widely adopted data augmentations. Furthermore, the universal
adversarial perturbations (UAP) in the DCT space are designed as the trigger,
to increase IQA model susceptibility to manipulation and improve attack
effectiveness. In addition to the heuristic method for poison-label BAIQA
(P-BAIQA), we explore the design of clean-label BAIQA (C-BAIQA), focusing on
$\alpha$ sampling and image data refinement, driven by theoretical insights we
reveal. Extensive experiments on diverse datasets and various NR-IQA models
demonstrate the effectiveness of our attacks. Code will be released at
https://github.com/yuyi-sd/BAIQA.
comment: Accept by AAAI 2025
☆ A Generative Victim Model for Segmentation
We find that the well-trained victim models (VMs), against which the attacks
are generated, serve as fundamental prerequisites for adversarial attacks, i.e.
a segmentation VM is needed to generate attacks for segmentation. In this
context, the victim model is assumed to be robust to achieve effective
adversarial perturbation generation. Instead of focusing on improving the
robustness of the task-specific victim models, we shift our attention to image
generation. From an image generation perspective, we derive a novel VM for
segmentation, aiming to generate adversarial perturbations for segmentation
tasks without requiring models explicitly designed for image segmentation. Our
approach to adversarial attack generation diverges from conventional white-box
or black-box attacks, offering a fresh outlook on adversarial attack
strategies. Experiments show that our attack method is able to generate
effective adversarial attacks with good transferability.
☆ QuantFormer: Learning to Quantize for Neural Activity Forecasting in Mouse Visual Cortex
Salvatore Calcagno, Isaak Kavasidis, Simone Palazzo, Marco Brondi, Luca Sità, Giacomo Turri, Daniela Giordano, Vladimir R. Kostic, Tommaso Fellin, Massimiliano Pontil, Concetto Spampinato
Understanding complex animal behaviors hinges on deciphering the neural
activity patterns within brain circuits, making the ability to forecast neural
activity crucial for developing predictive models of brain dynamics. This
capability holds immense value for neuroscience, particularly in applications
such as real-time optogenetic interventions. While traditional encoding and
decoding methods have been used to map external variables to neural activity
and vice versa, they focus on interpreting past data. In contrast, neural
forecasting aims to predict future neural activity, presenting a unique and
challenging task due to the spatiotemporal sparsity and complex dependencies of
neural signals. Existing transformer-based forecasting methods, while effective
in many domains, struggle to capture the distinctiveness of neural signals
characterized by spatiotemporal sparsity and intricate dependencies. To address
this challenge, we here introduce QuantFormer, a transformer-based model
specifically designed for forecasting neural activity from two-photon calcium
imaging data. Unlike conventional regression-based approaches,
QuantFormerreframes the forecasting task as a classification problem via
dynamic signal quantization, enabling more effective learning of sparse neural
activation patterns. Additionally, QuantFormer tackles the challenge of
analyzing multivariate signals from an arbitrary number of neurons by
incorporating neuron-specific tokens, allowing scalability across diverse
neuronal populations. Trained with unsupervised quantization on the Allen
dataset, QuantFormer sets a new benchmark in forecasting mouse visual cortex
activity. It demonstrates robust performance and generalization across various
stimuli and individuals, paving the way for a foundational model in neural
signal prediction.
☆ Deep Lidar-guided Image Deblurring
The rise of portable Lidar instruments, including their adoption in
smartphones, opens the door to novel computational imaging techniques. Being an
active sensing instrument, Lidar can provide complementary data to passive
optical sensors, particularly in situations like low-light imaging where motion
blur can affect photos. In this paper, we study if the depth information
provided by mobile Lidar sensors is useful for the task of image deblurring and
how to integrate it with a general approach that transforms any
state-of-the-art neural deblurring model into a depth-aware one. To achieve
this, we developed a universal adapter structure that efficiently preprocesses
the depth information to modulate image features with depth features.
Additionally, we applied a continual learning strategy to pretrained
encoder-decoder models, enabling them to incorporate depth information as an
additional input with minimal extra data requirements. We demonstrate that
utilizing true depth information can significantly boost the effectiveness of
deblurring algorithms, as validated on a dataset with real-world depth data
captured by a smartphone Lidar.
☆ DFREC: DeepFake Identity Recovery Based on Identity-aware Masked Autoencoder
Recent advances in deepfake forensics have primarily focused on improving the
classification accuracy and generalization performance. Despite enormous
progress in detection accuracy across a wide variety of forgery algorithms,
existing algorithms lack intuitive interpretability and identity traceability
to help with forensic investigation. In this paper, we introduce a novel
DeepFake Identity Recovery scheme (DFREC) to fill this gap. DFREC aims to
recover the pair of source and target faces from a deepfake image to facilitate
deepfake identity tracing and reduce the risk of deepfake attack. It comprises
three key components: an Identity Segmentation Module (ISM), a Source Identity
Reconstruction Module (SIRM), and a Target Identity Reconstruction Module
(TIRM). The ISM segments the input face into distinct source and target face
information, and the SIRM reconstructs the source face and extracts latent
target identity features with the segmented source information. The background
context and latent target identity features are synergetically fused by a
Masked Autoencoder in the TIRM to reconstruct the target face. We evaluate
DFREC on six different high-fidelity face-swapping attacks on FaceForensics++,
CelebaMegaFS and FFHQ-E4S datasets, which demonstrate its superior recovery
performance over state-of-the-art deepfake recovery algorithms. In addition,
DFREC is the only scheme that can recover both pristine source and target faces
directly from the forgery image with high fadelity.
☆ Modeling Dual-Exposure Quad-Bayer Patterns for Joint Denoising and Deblurring IEEE
Image degradation caused by noise and blur remains a persistent challenge in
imaging systems, stemming from limitations in both hardware and methodology.
Single-image solutions face an inherent tradeoff between noise reduction and
motion blur. While short exposures can capture clear motion, they suffer from
noise amplification. Long exposures reduce noise but introduce blur.
Learning-based single-image enhancers tend to be over-smooth due to the limited
information. Multi-image solutions using burst mode avoid this tradeoff by
capturing more spatial-temporal information but often struggle with
misalignment from camera/scene motion. To address these limitations, we propose
a physical-model-based image restoration approach leveraging a novel
dual-exposure Quad-Bayer pattern sensor. By capturing pairs of short and long
exposures at the same starting point but with varying durations, this method
integrates complementary noise-blur information within a single image. We
further introduce a Quad-Bayer synthesis method (B2QB) to simulate sensor data
from Bayer patterns to facilitate training. Based on this dual-exposure sensor
model, we design a hierarchical convolutional neural network called QRNet to
recover high-quality RGB images. The network incorporates input enhancement
blocks and multi-level feature extraction to improve restoration quality.
Experiments demonstrate superior performance over state-of-the-art deblurring
and denoising methods on both synthetic and real-world datasets. The code,
model, and datasets are publicly available at
https://github.com/zhaoyuzhi/QRNet.
comment: accepted by IEEE Transactions on Image Processing (TIP)
☆ CapGen:An Environment-Adaptive Generator of Adversarial Patches
Adversarial patches, often used to provide physical stealth protection for
critical assets and assess perception algorithm robustness, usually neglect the
need for visual harmony with the background environment, making them easily
noticeable. Moreover, existing methods primarily concentrate on improving
attack performance, disregarding the intricate dynamics of adversarial patch
elements. In this work, we introduce the Camouflaged Adversarial Pattern
Generator (CAPGen), a novel approach that leverages specific base colors from
the surrounding environment to produce patches that seamlessly blend with their
background for superior visual stealthiness while maintaining robust
adversarial performance. We delve into the influence of both patterns (i.e.,
color-agnostic texture information) and colors on the effectiveness of attacks
facilitated by patches, discovering that patterns exert a more pronounced
effect on performance than colors. Based on these findings, we propose a rapid
generation strategy for adversarial patches. This involves updating the colors
of high-performance adversarial patches to align with those of the new
environment, ensuring visual stealthiness without compromising adversarial
impact. This paper is the first to comprehensively examine the roles played by
patterns and colors in the context of adversarial patches.
☆ Buster: Incorporating Backdoor Attacks into Text Encoder to Mitigate NSFW Content Generation
In the digital age, the proliferation of deep learning models has led to
significant concerns about the generation of Not Safe for Work (NSFW) content.
Existing defense methods primarily involve model fine-tuning and post-hoc
content moderation. However, these approaches often lack scalability in
eliminating harmful content, degrade the quality of benign image generation, or
incur high inference costs. To tackle these challenges, we propose an
innovative framework called \textbf{Buster}, which injects backdoor attacks
into the text encoder to prevent NSFW content generation. Specifically, Buster
leverages deep semantic information rather than explicit prompts as triggers,
redirecting NSFW prompts towards targeted benign prompts. This approach
demonstrates exceptional resilience and scalability in mitigating NSFW content.
Remarkably, Buster fine-tunes the text encoder of Text-to-Image models within
just five minutes, showcasing high efficiency. Our extensive experiments reveal
that Buster outperforms all other baselines, achieving superior NSFW content
removal rate while preserving the quality of harmless images.
☆ Driving with InternVL: Oustanding Champion in the Track on Driving with Language of the Autonomous Grand Challenge at CVPR 2024
This technical report describes the methods we employed for the Driving with
Language track of the CVPR 2024 Autonomous Grand Challenge. We utilized a
powerful open-source multimodal model, InternVL-1.5, and conducted a
full-parameter fine-tuning on the competition dataset, DriveLM-nuScenes. To
effectively handle the multi-view images of nuScenes and seamlessly inherit
InternVL's outstanding multimodal understanding capabilities, we formatted and
concatenated the multi-view images in a specific manner. This ensured that the
final model could meet the specific requirements of the competition task while
leveraging InternVL's powerful image understanding capabilities. Meanwhile, we
designed a simple automatic annotation strategy that converts the center points
of objects in DriveLM-nuScenes into corresponding bounding boxes. As a result,
our single model achieved a score of 0.6002 on the final leadboard.
☆ ArtFormer: Controllable Generation of Diverse 3D Articulated Objects
This paper presents a novel framework for modeling and conditional generation
of 3D articulated objects. Troubled by flexibility-quality tradeoffs, existing
methods are often limited to using predefined structures or retrieving shapes
from static datasets. To address these challenges, we parameterize an
articulated object as a tree of tokens and employ a transformer to generate
both the object's high-level geometry code and its kinematic relations.
Subsequently, each sub-part's geometry is further decoded using a
signed-distance-function (SDF) shape prior, facilitating the synthesis of
high-quality 3D shapes. Our approach enables the generation of diverse objects
with high-quality geometry and varying number of parts. Comprehensive
experiments on conditional generation from text descriptions demonstrate the
effectiveness and flexibility of our method.
comment: impl. repo: https://github.com/ShuYuMo2003/ArtFormer
☆ Repetitive Action Counting with Hybrid Temporal Relation Modeling IEEE
Repetitive Action Counting (RAC) aims to count the number of repetitive
actions occurring in videos. In the real world, repetitive actions have great
diversity and bring numerous challenges (e.g., viewpoint changes, non-uniform
periods, and action interruptions). Existing methods based on the temporal
self-similarity matrix (TSSM) for RAC are trapped in the bottleneck of
insufficient capturing action periods when applied to complicated daily videos.
To tackle this issue, we propose a novel method named Hybrid Temporal Relation
Modeling Network (HTRM-Net) to build diverse TSSM for RAC. The HTRM-Net mainly
consists of three key components: bi-modal temporal self-similarity matrix
modeling, random matrix dropping, and local temporal context modeling.
Specifically, we construct temporal self-similarity matrices by bi-modal
(self-attention and dual-softmax) operations, yielding diverse matrix
representations from the combination of row-wise and column-wise correlations.
To further enhance matrix representations, we propose incorporating a random
matrix dropping module to guide channel-wise learning of the matrix explicitly.
After that, we inject the local temporal context of video frames and the
learned matrix into temporal correlation modeling, which can make the model
robust enough to cope with error-prone situations, such as action interruption.
Finally, a multi-scale matrix fusion module is designed to aggregate temporal
correlations adaptively in multi-scale matrices. Extensive experiments across
intra- and cross-datasets demonstrate that the proposed method not only
outperforms current state-of-the-art methods but also exhibits robust
capabilities in accurately counting repetitive actions in unseen action
categories. Notably, our method surpasses the classical TransRAC method by
20.04\% in MAE and 22.76\% in OBO.
comment: To be published in IEEE Transactions on Multimedia
☆ Deep Non-rigid Structure-from-Motion Revisited: Canonicalization and Sequence Modeling
Non-Rigid Structure-from-Motion (NRSfM) is a classic 3D vision problem, where
a 2D sequence is taken as input to estimate the corresponding 3D sequence.
Recently, the deep neural networks have greatly advanced the task of NRSfM.
However, existing deep NRSfM methods still have limitations in handling the
inherent sequence property and motion ambiguity associated with the NRSfM
problem. In this paper, we revisit deep NRSfM from two perspectives to address
the limitations of current deep NRSfM methods : (1) canonicalization and (2)
sequence modeling. We propose an easy-to-implement per-sequence
canonicalization method as opposed to the previous per-dataset canonicalization
approaches. With this in mind, we propose a sequence modeling method that
combines temporal information and subspace constraint. As a result, we have
achieved a more optimal NRSfM reconstruction pipeline compared to previous
efforts. The effectiveness of our method is verified by testing the
sequence-to-sequence deep NRSfM pipeline with corresponding regularization
modules on several commonly used datasets.
comment: 9 pages main text, 7 pages appendix
☆ Moderating the Generalization of Score-based Generative Model
Score-based Generative Models (SGMs) have demonstrated remarkable
generalization abilities, e.g. generating unseen, but natural data. However,
the greater the generalization power, the more likely the unintended
generalization, and the more dangerous the abuse. Research on moderated
generalization in SGMs remains limited. To fill this gap, we first examine the
current 'gold standard' in Machine Unlearning (MU), i.e., re-training the model
after removing the undesirable training data, and find it does not work in
SGMs. Further analysis of score functions reveals that the MU 'gold standard'
does not alter the original score function, which explains its ineffectiveness.
Based on this insight, we propose the first Moderated Score-based Generative
Model (MSGM), which introduces a novel score adjustment strategy that redirects
the score function away from undesirable data during the continuous-time
stochastic differential equation process. Extensive experimental results
demonstrate that MSGM significantly reduces the likelihood of generating
undesirable content while preserving high visual quality for normal image
generation. Albeit designed for SGMs, MSGM is a general and flexible MU
framework that is compatible with diverse diffusion architectures (SGM and
DDPM) and training strategies (re-training and fine-tuning), and enables
zero-shot transfer of the pre-trained models to downstream tasks, e.g. image
inpainting and reconstruction. The code will be shared upon acceptance.
☆ Attention Head Purification: A New Perspective to Harness CLIP for Domain Generalization
Domain Generalization (DG) aims to learn a model from multiple source domains
to achieve satisfactory performance on unseen target domains. Recent works
introduce CLIP to DG tasks due to its superior image-text alignment and
zeros-shot performance. Previous methods either utilize full fine-tuning or
prompt-learning paradigms to harness CLIP for DG tasks. Those works focus on
avoiding catastrophic forgetting of the original knowledge encoded in CLIP but
ignore that the knowledge encoded in CLIP in nature may contain domain-specific
cues that constrain its domain generalization performance. In this paper, we
propose a new perspective to harness CLIP for DG, i.e., attention head
purification. We observe that different attention heads may encode different
properties of an image and selecting heads appropriately may yield remarkable
performance improvement across domains. Based on such observations, we purify
the attention heads of CLIP from two levels, including task-level purification
and domain-level purification. For task-level purification, we design
head-aware LoRA to make each head more adapted to the task we considered. For
domain-level purification, we perform head selection via a simple gating
strategy. We utilize MMD loss to encourage masked head features to be more
domain-invariant to emphasize more generalizable properties/heads. During
training, we jointly perform task-level purification and domain-level
purification. We conduct experiments on various representative DG benchmarks.
Though simple, extensive experiments demonstrate that our method performs
favorably against previous state-of-the-arts.
☆ EchoIR: Advancing Image Restoration with Echo Upsampling and Bi-Level Optimization
Image restoration represents a fundamental challenge in low-level vision,
focusing on reconstructing high-quality images from their degraded
counterparts. With the rapid advancement of deep learning technologies,
transformer-based methods with pyramid structures have advanced the field by
capturing long-range cross-scale spatial interaction. Despite its popularity,
the degradation of essential features during the upsampling process notably
compromised the restoration performance, resulting in suboptimal reconstruction
outcomes. We introduce the EchoIR, an UNet-like image restoration network with
a bilateral learnable upsampling mechanism to bridge this gap. Specifically, we
proposed the Echo-Upsampler that optimizes the upsampling process by learning
from the bilateral intermediate features of U-Net, the "Echo", aiming for a
more refined restoration by minimizing the degradation during upsampling. In
pursuit of modeling a hierarchical model of image restoration and upsampling
tasks, we propose the Approximated Sequential Bi-level Optimization (AS-BLO),
an advanced bi-level optimization model establishing a relationship between
upsampling learning and image restoration tasks. Extensive experiments against
the state-of-the-art (SOTA) methods demonstrate the proposed EchoIR surpasses
the existing methods, achieving SOTA performance in image restoration tasks.
☆ MPSI: Mamba enhancement model for pixel-wise sequential interaction Image Super-Resolution
Single image super-resolution (SR) has long posed a challenge in the field of
computer vision. While the advent of deep learning has led to the emergence of
numerous methods aimed at tackling this persistent issue, the current
methodologies still encounter challenges in modeling long sequence information,
leading to limitations in effectively capturing the global pixel interactions.
To tackle this challenge and achieve superior SR outcomes, we propose the Mamba
pixel-wise sequential interaction network (MPSI), aimed at enhancing the
establishment of long-range connections of information, particularly focusing
on pixel-wise sequential interaction. We propose the Channel-Mamba Block (CMB)
to capture comprehensive pixel interaction information by effectively modeling
long sequence information. Moreover, in the existing SR methodologies, there
persists the issue of the neglect of features extracted by preceding layers,
leading to the loss of valuable feature information. While certain existing
models strive to preserve these features, they frequently encounter difficulty
in establishing connections across all layers. To overcome this limitation,
MPSI introduces the Mamba channel recursion module (MCRM), which maximizes the
retention of valuable feature information from early layers, thereby
facilitating the acquisition of pixel sequence interaction information from
multiple-level layers. Through extensive experimentation, we demonstrate that
MPSI outperforms existing super-resolution methods in terms of image
reconstruction results, attaining state-of-the-art performance.
☆ Taylor Outlier Exposure
Out-of-distribution (OOD) detection is the task of identifying data sampled
from distributions that were not used during training. This task is essential
for reliable machine learning and a better understanding of their
generalization capabilities. Among OOD detection methods, Outlier Exposure (OE)
significantly enhances OOD detection performance and generalization ability by
exposing auxiliary OOD data to the model. However, constructing clean auxiliary
OOD datasets, uncontaminated by in-distribution (ID) samples, is essential for
OE; generally, a noisy OOD dataset contaminated with ID samples negatively
impacts OE training dynamics and final detection performance. Furthermore, as
dataset scale increases, constructing clean OOD data becomes increasingly
challenging and costly. To address these challenges, we propose Taylor Outlier
Exposure (TaylorOE), an OE-based approach with regularization that allows
training on noisy OOD datasets contaminated with ID samples. Specifically, we
represent the OE regularization term as a polynomial function via a Taylor
expansion, allowing us to control the regularization strength for ID data in
the auxiliary OOD dataset by adjusting the order of Taylor expansion. In our
experiments on the OOD detection task with clean and noisy OOD datasets, we
demonstrate that the proposed method consistently outperforms conventional
methods and analyze our regularization term to show its effectiveness. Our
implementation code of TaylorOE is available at
\url{https://github.com/fukuchan41/TaylorOE}.
☆ Crack-EdgeSAM Self-Prompting Crack Segmentation System for Edge Devices
Structural health monitoring (SHM) is essential for the early detection of
infrastructure defects, such as cracks in concrete bridge pier. but often faces
challenges in efficiency and accuracy in complex environments. Although the
Segment Anything Model (SAM) achieves excellent segmentation performance, its
computational demands limit its suitability for real-time applications on edge
devices. To address these challenges, this paper proposes Crack-EdgeSAM, a
self-prompting crack segmentation system that integrates YOLOv8 for generating
prompt boxes and a fine-tuned EdgeSAM model for crack segmentation. To ensure
computational efficiency, the method employs ConvLoRA, a Parameter-Efficient
Fine-Tuning (PEFT) technique, along with DiceFocalLoss to fine-tune the EdgeSAM
model. Our experimental results on public datasets and the climbing robot
automatic inspections demonstrate that the system achieves high segmentation
accuracy and significantly enhanced inference speed compared to the most recent
methods. Notably, the system processes 1024 x 1024 pixels images at 46 FPS on
our PC and 8 FPS on Jetson Orin Nano.
☆ Learning Spatially Decoupled Color Representations for Facial Image Colorization
Image colorization methods have shown prominent performance on natural
images. However, since humans are more sensitive to faces, existing methods are
insufficient to meet the demands when applied to facial images, typically
showing unnatural and uneven colorization results. In this paper, we
investigate the facial image colorization task and find that the problems with
facial images can be attributed to an insufficient understanding of facial
components. As a remedy, by introducing facial component priors, we present a
novel facial image colorization framework dubbed FCNet. Specifically, we learn
a decoupled color representation for each face component (e.g., lips, skin,
eyes, and hair) under the guidance of face parsing maps. A chromatic and
spatial augmentation strategy is presented to facilitate the learning
procedure, which requires only grayscale and color facial image pairs. After
training, the presented FCNet can be naturally applied to facial image
colorization with single or multiple reference images. To expand the
application paradigms to scenarios with no reference images, we further train
two alternative modules, which predict the color representations from the
grayscale input or a random seed, respectively. Extensive experiments show that
our method can perform favorably against existing methods in various
application scenarios (i.e., no-, single-, and multi-reference facial image
colorization). The source code and pre-trained models will be publicly
available.
☆ A Parametric Approach to Adversarial Augmentation for Cross-Domain Iris Presentation Attack Detection WACV
Iris-based biometric systems are vulnerable to presentation attacks (PAs),
where adversaries present physical artifacts (e.g., printed iris images,
textured contact lenses) to defeat the system. This has led to the development
of various presentation attack detection (PAD) algorithms, which typically
perform well in intra-domain settings. However, they often struggle to
generalize effectively in cross-domain scenarios, where training and testing
employ different sensors, PA instruments, and datasets. In this work, we use
adversarial training samples of both bonafide irides and PAs to improve the
cross-domain performance of a PAD classifier. The novelty of our approach lies
in leveraging transformation parameters from classical data augmentation
schemes (e.g., translation, rotation) to generate adversarial samples. We
achieve this through a convolutional autoencoder, ADV-GEN, that inputs original
training samples along with a set of geometric and photometric transformations.
The transformation parameters act as regularization variables, guiding ADV-GEN
to generate adversarial samples in a constrained search space. Experiments
conducted on the LivDet-Iris 2017 database, comprising four datasets, and the
LivDet-Iris 2020 dataset, demonstrate the efficacy of our proposed method. The
code is available at https://github.com/iPRoBe-lab/ADV-GEN-IrisPAD.
comment: IEEE/CVF Winter Conference on Applications of Computer Vision (WACV),
2025
☆ Fine-grained Text to Image Synthesis
Fine-grained text to image synthesis involves generating images from texts
that belong to different categories. In contrast to general text to image
synthesis, in fine-grained synthesis there is high similarity between images of
different subclasses, and there may be linguistic discrepancy among texts
describing the same image. Recent Generative Adversarial Networks (GAN), such
as the Recurrent Affine Transformation (RAT) GAN model, are able to synthesize
clear and realistic images from texts. However, GAN models ignore fine-grained
level information. In this paper we propose an approach that incorporates an
auxiliary classifier in the discriminator and a contrastive learning method to
improve the accuracy of fine-grained details in images synthesized by RAT GAN.
The auxiliary classifier helps the discriminator classify the class of images,
and helps the generator synthesize more accurate fine-grained images. The
contrastive learning method minimizes the similarity between images from
different subclasses and maximizes the similarity between images from the same
subclass. We evaluate on several state-of-the-art methods on the commonly used
CUB-200-2011 bird dataset and Oxford-102 flower dataset, and demonstrated
superior performance.
☆ A Progressive Image Restoration Network for High-order Degradation Imaging in Remote Sensing
Recently, deep learning methods have gained remarkable achievements in the
field of image restoration for remote sensing (RS). However, most existing RS
image restoration methods focus mainly on conventional first-order degradation
models, which may not effectively capture the imaging mechanisms of remote
sensing images. Furthermore, many RS image restoration approaches that use deep
learning are often criticized for their lacks of architecture transparency and
model interpretability. To address these problems, we propose a novel
progressive restoration network for high-order degradation imaging (HDI-PRNet),
to progressively restore different image degradation. HDI-PRNet is developed
based on the theoretical framework of degradation imaging, offering the benefit
of mathematical interpretability within the unfolding network. The framework is
composed of three main components: a module for image denoising that relies on
proximal mapping prior learning, a module for image deblurring that integrates
Neumann series expansion with dual-domain degradation learning, and a module
for super-resolution. Extensive experiments demonstrate that our method
achieves superior performance on both synthetic and real remote sensing images.
comment: 14 pages
☆ A Step towards Automated and Generalizable Tactile Map Generation using Generative Adversarial Networks
Blindness and visual impairments affect many people worldwide. For help with
navigation, people with visual impairments often rely on tactile maps that
utilize raised surfaces and edges to convey information through touch. Although
these maps are helpful, they are often not widely available and current tools
to automate their production have similar limitations including only working at
certain scales, for particular world regions, or adhering to specific tactile
map standards. To address these shortcomings, we train a proof-of-concept model
as a first step towards applying computer vision techniques to help automate
the generation of tactile maps. We create a first-of-its-kind tactile maps
dataset of street-views from Google Maps spanning 6500 locations and including
different tactile line- and area-like features. Generative adversarial network
(GAN) models trained on a single zoom successfully identify key map elements,
remove extraneous ones, and perform inpainting with median F1 and
intersection-over-union (IoU) scores of better than 0.97 across all features.
Models trained on two zooms experience only minor drops in performance, and
generalize well both to unseen map scales and world regions. Finally, we
discuss future directions towards a full implementation of a tactile map
solution that builds on our results.
☆ Exploring What Why and How: A Multifaceted Benchmark for Causation Understanding of Video Anomaly IEEE
Hang Du, Guoshun Nan, Jiawen Qian, Wangchenhui Wu, Wendi Deng, Hanqing Mu, Zhenyan Chen, Pengxuan Mao, Xiaofeng Tao, Jun Liu
Recent advancements in video anomaly understanding (VAU) have opened the door
to groundbreaking applications in various fields, such as traffic monitoring
and industrial automation. While the current benchmarks in VAU predominantly
emphasize the detection and localization of anomalies. Here, we endeavor to
delve deeper into the practical aspects of VAU by addressing the essential
questions: "what anomaly occurred?", "why did it happen?", and "how severe is
this abnormal event?". In pursuit of these answers, we introduce a
comprehensive benchmark for Exploring the Causation of Video Anomalies (ECVA).
Our benchmark is meticulously designed, with each video accompanied by detailed
human annotations. Specifically, each instance of our ECVA involves three sets
of human annotations to indicate "what", "why" and "how" of an anomaly,
including 1) anomaly type, start and end times, and event descriptions, 2)
natural language explanations for the cause of an anomaly, and 3) free text
reflecting the effect of the abnormality. Building upon this foundation, we
propose a novel prompt-based methodology that serves as a baseline for tackling
the intricate challenges posed by ECVA. We utilize "hard prompt" to guide the
model to focus on the critical parts related to video anomaly segments, and
"soft prompt" to establish temporal and spatial relationships within these
anomaly segments. Furthermore, we propose AnomEval, a specialized evaluation
metric crafted to align closely with human judgment criteria for ECVA. This
metric leverages the unique features of the ECVA dataset to provide a more
comprehensive and reliable assessment of various video large language models.
We demonstrate the efficacy of our approach through rigorous experimental
analysis and delineate possible avenues for further investigation into the
comprehension of video anomaly causation.
comment: Submitted to IEEE Transactions on Pattern Analysis and Machine
Intelligence. arXiv admin note: substantial text overlap with
arXiv:2405.00181
☆ An Enhancement of CNN Algorithm for Rice Leaf Disease Image Classification in Mobile Applications
Kayne Uriel K. Rodrigo, Jerriane Hillary Heart S. Marcial, Samuel C. Brillo, Khatalyn E. Mata, Jonathan C. Morano
This study focuses on enhancing rice leaf disease image classification
algorithms, which have traditionally relied on Convolutional Neural Network
(CNN) models. We employed transfer learning with MobileViTV2_050 using
ImageNet-1k weights, a lightweight model that integrates CNN's local feature
extraction with Vision Transformers' global context learning through a
separable self-attention mechanism. Our approach resulted in a significant
15.66% improvement in classification accuracy for MobileViTV2_050-A, our first
enhanced model trained on the baseline dataset, achieving 93.14%. Furthermore,
MobileViTV2_050-B, our second enhanced model trained on a broader rice leaf
dataset, demonstrated a 22.12% improvement, reaching 99.6% test accuracy.
Additionally, MobileViTV2-A attained an F1-score of 93% across four rice labels
and a Receiver Operating Characteristic (ROC) curve ranging from 87% to 97%. In
terms of resource consumption, our enhanced models reduced the total parameters
of the baseline CNN model by up to 92.50%, from 14 million to 1.1 million.
These results indicate that MobileViTV2_050 not only improves computational
efficiency through its separable self-attention mechanism but also enhances
global context learning. Consequently, it offers a lightweight and robust
solution suitable for mobile deployment, advancing the interpretability and
practicality of models in precision agriculture.
comment: Presented at 46th World Conference on Applied Science, Engineering &
Technology (WCASET) from Institute for Educational Research and Publication
(IFERP)
☆ Robust Feature Engineering Techniques for Designing Efficient Motor Imagery-Based BCI-Systems
A multitude of individuals across the globe grapple with motor disabilities.
Neural prosthetics utilizing Brain-Computer Interface (BCI) technology exhibit
promise for improving motor rehabilitation outcomes. The intricate nature of
EEG data poses a significant hurdle for current BCI systems. Recently, a
qualitative repository of EEG signals tied to both upper and lower limb
execution of motor and motor imagery tasks has been unveiled. Despite this, the
productivity of the Machine Learning (ML) Models that were trained on this
dataset was alarmingly deficient, and the evaluation framework seemed
insufficient. To enhance outcomes, robust feature engineering (signal
processing) methodologies are implemented. A collection of time domain,
frequency domain, and wavelet-derived features was obtained from 16-channel EEG
signals, and the Maximum Relevance Minimum Redundancy (MRMR) approach was
employed to identify the four most significant features. For classification K
Nearest Neighbors (KNN), Support Vector Machine (SVM), Decision Tree (DT), and
Na\"ive Bayes (NB) models were implemented with these selected features,
evaluating their effectiveness through metrics such as testing accuracy,
precision, recall, and F1 Score. By leveraging SVM with a Gaussian Kernel, a
remarkable maximum testing accuracy of 92.50% for motor activities and 95.48%
for imagery activities is achieved. These results are notably more dependable
and gratifying compared to the previous study, where the peak accuracy was
recorded at 74.36%. This research work provides an in-depth analysis of the MI
Limb EEG dataset and it will help in designing and developing simple,
cost-effective and reliable BCI systems for neuro-rehabilitation.
comment: 26 pages
★ Rate-In: Information-Driven Adaptive Dropout Rates for Improved Inference-Time Uncertainty Estimation
Accurate uncertainty estimation is crucial for deploying neural networks in
risk-sensitive applications such as medical diagnosis. Monte Carlo Dropout is a
widely used technique for approximating predictive uncertainty by performing
stochastic forward passes with dropout during inference. However, using static
dropout rates across all layers and inputs can lead to suboptimal uncertainty
estimates, as it fails to adapt to the varying characteristics of individual
inputs and network layers. Existing approaches optimize dropout rates during
training using labeled data, resulting in fixed inference-time parameters that
cannot adjust to new data distributions, compromising uncertainty estimates in
Monte Carlo simulations.
In this paper, we propose Rate-In, an algorithm that dynamically adjusts
dropout rates during inference by quantifying the information loss induced by
dropout in each layer's feature maps. By treating dropout as controlled noise
injection and leveraging information-theoretic principles, Rate-In adapts
dropout rates per layer and per input instance without requiring ground truth
labels. By quantifying the functional information loss in feature maps, we
adaptively tune dropout rates to maintain perceptual quality across diverse
medical imaging tasks and architectural configurations. Our extensive empirical
study on synthetic data and real-world medical imaging tasks demonstrates that
Rate-In improves calibration and sharpens uncertainty estimates compared to
fixed or heuristic dropout rates without compromising predictive performance.
Rate-In offers a practical, unsupervised, inference-time approach to optimizing
dropout for more reliable predictive uncertainty estimation in critical
applications.
☆ 3A-YOLO: New Real-Time Object Detectors with Triple Discriminative Awareness and Coordinated Representations
Recent research on real-time object detectors (e.g., YOLO series) has
demonstrated the effectiveness of attention mechanisms for elevating model
performance. Nevertheless, existing methods neglect to unifiedly deploy
hierarchical attention mechanisms to construct a more discriminative YOLO head
which is enriched with more useful intermediate features. To tackle this gap,
this work aims to leverage multiple attention mechanisms to hierarchically
enhance the triple discriminative awareness of the YOLO detection head and
complementarily learn the coordinated intermediate representations, resulting
in a new series detectors denoted 3A-YOLO. Specifically, we first propose a new
head denoted TDA-YOLO Module, which unifiedly enhance the representations
learning of scale-awareness, spatial-awareness, and task-awareness. Secondly,
we steer the intermediate features to coordinately learn the inter-channel
relationships and precise positional information. Finally, we perform neck
network improvements followed by introducing various tricks to boost the
adaptability of 3A-YOLO. Extensive experiments across COCO and VOC benchmarks
indicate the effectiveness of our detectors.
☆ Fast Occupancy Network
Occupancy Network has recently attracted much attention in autonomous
driving. Instead of monocular 3D detection and recent bird's eye view(BEV)
models predicting 3D bounding box of obstacles, Occupancy Network predicts the
category of voxel in specified 3D space around the ego vehicle via transforming
3D detection task into 3D voxel segmentation task, which has much superiority
in tackling category outlier obstacles and providing fine-grained 3D
representation. However, existing methods usually require huge computation
resources than previous methods, which hinder the Occupancy Network solution
applying in intelligent driving systems. To address this problem, we make an
analysis of the bottleneck of Occupancy Network inference cost, and present a
simple and fast Occupancy Network model, which adopts a deformable 2D
convolutional layer to lift BEV feature to 3D voxel feature and presents an
efficient voxel feature pyramid network (FPN) module to improve performance
with few computational cost. Further, we present a cost-free 2D segmentation
branch in perspective view after feature extractors for Occupancy Network
during inference phase to improve accuracy. Experimental results demonstrate
that our method consistently outperforms existing methods in both accuracy and
inference speed, which surpasses recent state-of-the-art (SOTA) OCCNet by 1.7%
with ResNet50 backbone with about 3X inference speedup. Furthermore, our method
can be easily applied to existing BEV models to transform them into Occupancy
Network models.
comment: 10 pages, 5 figures,
☆ Compositional Zero-Shot Learning with Contextualized Cues and Adaptive Contrastive Training
Compositional Zero-Shot Learning (CZSL) aims to recognize unseen combinations
of seen attributes and objects. Current CLIP-based methods in CZSL, despite
their advancements, often fail to effectively understand and link the
attributes and objects due to inherent limitations in CLIP's pretraining
mechanisms. To address these shortcomings, this paper introduces a novel
framework, Understanding and Linking Attributes and Objects (ULAO) in CZSL,
which comprises two innovative modules. The Understanding Attributes and
Objects (UAO) module improves primitive understanding by sequential primitive
prediction and leveraging recognized objects as contextual hints for attribute
classification. Concurrently, the Linking Attributes and Objects (LAO) module
improves the attribute-object linkage understanding through a new contrastive
learning strategy that incorporates tailored hard negative generation and
adaptive loss adjustments. We demonstrate our model's superiority by showcasing
its state-of-the-art performance across three benchmark datasets in both
Closed-World (CW) and Open-World (OW) scenarios.
☆ Motion-aware Contrastive Learning for Temporal Panoptic Scene Graph Generation AAAI 2025
To equip artificial intelligence with a comprehensive understanding towards a
temporal world, video and 4D panoptic scene graph generation abstracts visual
data into nodes to represent entities and edges to capture temporal relations.
Existing methods encode entity masks tracked across temporal dimensions (mask
tubes), then predict their relations with temporal pooling operation, which
does not fully utilize the motion indicative of the entities' relation. To
overcome this limitation, we introduce a contrastive representation learning
framework that focuses on motion pattern for temporal scene graph generation.
Firstly, our framework encourages the model to learn close representations for
mask tubes of similar subject-relation-object triplets. Secondly, we seek to
push apart mask tubes from their temporally shuffled versions. Moreover, we
also learn distant representations for mask tubes belonging to the same video
but different triplets. Extensive experiments show that our motion-aware
contrastive framework significantly improves state-of-the-art methods on both
video and 4D datasets.
comment: Accepted at AAAI 2025
☆ Multi-Scale Contrastive Learning for Video Temporal Grounding AAAI 2025
Temporal grounding, which localizes video moments related to a natural
language query, is a core problem of vision-language learning and video
understanding. To encode video moments of varying lengths, recent methods
employ a multi-level structure known as a feature pyramid. In this structure,
lower levels concentrate on short-range video moments, while higher levels
address long-range moments. Because higher levels experience downsampling to
accommodate increasing moment length, their capacity to capture information is
reduced and consequently leads to degraded information in moment
representations. To resolve this problem, we propose a contrastive learning
framework to capture salient semantics among video moments. Our key methodology
is to leverage samples from the feature space emanating from multiple stages of
the video encoder itself requiring neither data augmentation nor online memory
banks to obtain positive and negative samples. To enable such an extension, we
introduce a sampling process to draw multiple video moments corresponding to a
common query. Subsequently, by utilizing these moments' representations across
video encoder layers, we instantiate a novel form of multi-scale and
cross-scale contrastive learning that links local short-range video moments
with global long-range video moments. Extensive experiments demonstrate the
effectiveness of our framework for not only long-form but also short-form video
grounding.
comment: Accepted at AAAI 2025
☆ QCResUNet: Joint Subject-level and Voxel-level Segmentation Quality Prediction
Deep learning has made significant strides in automated brain tumor
segmentation from magnetic resonance imaging (MRI) scans in recent years.
However, the reliability of these tools is hampered by the presence of
poor-quality segmentation outliers, particularly in out-of-distribution
samples, making their implementation in clinical practice difficult. Therefore,
there is a need for quality control (QC) to screen the quality of the
segmentation results. Although numerous automatic QC methods have been
developed for segmentation quality screening, most were designed for cardiac
MRI segmentation, which involves a single modality and a single tissue type.
Furthermore, most prior works only provided subject-level predictions of
segmentation quality and did not identify erroneous parts segmentation that may
require refinement. To address these limitations, we proposed a novel
multi-task deep learning architecture, termed QCResUNet, which produces
subject-level segmentation-quality measures as well as voxel-level segmentation
error maps for each available tissue class. To validate the effectiveness of
the proposed method, we conducted experiments on assessing its performance on
evaluating the quality of two distinct segmentation tasks. First, we aimed to
assess the quality of brain tumor segmentation results. For this task, we
performed experiments on one internal and two external datasets. Second, we
aimed to evaluate the segmentation quality of cardiac Magnetic Resonance
Imaging (MRI) data from the Automated Cardiac Diagnosis Challenge. The proposed
method achieved high performance in predicting subject-level
segmentation-quality metrics and accurately identifying segmentation errors on
a voxel basis. This has the potential to be used to guide human-in-the-loop
feedback to improve segmentations in clinical settings.
☆ Annotation Techniques for Judo Combat Phase Classification from Tournament Footage
This paper presents a semi-supervised approach to extracting and analyzing
combat phases in judo tournaments using live-streamed footage. The objective is
to automate the annotation and summarization of live streamed judo matches. We
train models that extract relevant entities and classify combat phases from
fixed-perspective judo recordings. We employ semi-supervised methods to address
limited labeled data in the domain. We build a model of combat phases via
transfer learning from a fine-tuned object detector to classify the presence,
activity, and standing state of the match. We evaluate our approach on a
dataset of 19 thirty-second judo clips, achieving an F1 score on a $20\%$ test
hold-out of 0.66, 0.78, and 0.87 for the three classes, respectively. Our
results show initial promise for automating more complex information retrieval
tasks using rigorous methods with limited labeled data.
☆ Hero-SR: One-Step Diffusion for Super-Resolution with Human Perception Priors
Owing to the robust priors of diffusion models, recent approaches have shown
promise in addressing real-world super-resolution (Real-SR). However, achieving
semantic consistency and perceptual naturalness to meet human perception
demands remains difficult, especially under conditions of heavy degradation and
varied input complexities. To tackle this, we propose Hero-SR, a one-step
diffusion-based SR framework explicitly designed with human perception priors.
Hero-SR consists of two novel modules: the Dynamic Time-Step Module (DTSM),
which adaptively selects optimal diffusion steps for flexibly meeting human
perceptual standards, and the Open-World Multi-modality Supervision (OWMS),
which integrates guidance from both image and text domains through CLIP to
improve semantic consistency and perceptual naturalness. Through these modules,
Hero-SR generates high-resolution images that not only preserve intricate
details but also reflect human perceptual preferences. Extensive experiments
validate that Hero-SR achieves state-of-the-art performance in Real-SR. The
code will be publicly available upon paper acceptance.
comment: 16 pages, 9 figures
☆ RAP-SR: RestorAtion Prior Enhancement in Diffusion Models for Realistic Image Super-Resolution
Benefiting from their powerful generative capabilities, pretrained diffusion
models have garnered significant attention for real-world image
super-resolution (Real-SR). Existing diffusion-based SR approaches typically
utilize semantic information from degraded images and restoration prompts to
activate prior for producing realistic high-resolution images. However,
general-purpose pretrained diffusion models, not designed for restoration
tasks, often have suboptimal prior, and manually defined prompts may fail to
fully exploit the generated potential. To address these limitations, we
introduce RAP-SR, a novel restoration prior enhancement approach in pretrained
diffusion models for Real-SR. First, we develop the High-Fidelity Aesthetic
Image Dataset (HFAID), curated through a Quality-Driven Aesthetic Image
Selection Pipeline (QDAISP). Our dataset not only surpasses existing ones in
fidelity but also excels in aesthetic quality. Second, we propose the
Restoration Priors Enhancement Framework, which includes Restoration Priors
Refinement (RPR) and Restoration-Oriented Prompt Optimization (ROPO) modules.
RPR refines the restoration prior using the HFAID, while ROPO optimizes the
unique restoration identifier, improving the quality of the resulting images.
RAP-SR effectively bridges the gap between general-purpose models and the
demands of Real-SR by enhancing restoration prior. Leveraging the plug-and-play
nature of RAP-SR, our approach can be seamlessly integrated into existing
diffusion-based SR methods, boosting their performance. Extensive experiments
demonstrate its broad applicability and state-of-the-art results. Codes and
datasets will be available upon acceptance.
comment: 15 pages, 12 figures
☆ MM-PoE: Multiple Choice Reasoning via. Process of Elimination using Multi-Modal Models
This paper introduces Multiple Choice Reasoning via. Process of Elimination
using Multi-Modal models, herein referred to as Multi-Modal Process of
Elimination (MM-PoE). This novel methodology is engineered to augment the
efficacy of Vision-Language Models (VLMs) in multiple-choice visual reasoning
tasks. Diverging from conventional approaches that evaluate each option
independently, MM-PoE employs a dual-step scoring paradigm that initially
identifies and excludes implausible choices, subsequently concentrating on the
most probable remaining options. This method emulates human test-taking
strategies, where individuals typically eliminate clearly incorrect answers
prior to selecting the optimal response. Our empirical evaluations, conducted
across three benchmark datasets, reveal that MM-PoE significantly improves both
zero-shot and few-shot performance of contemporary state-of-the-art VLMs.
Critically, this approach not only broadens the application of the elimination
process to multi-modal contexts but also allows few-shot experiments, thereby
addressing two principal limitations concerning usage of PoE only in zero-shot
settings and only with a language-only framework. As a result, MM-PoE not only
refines the reasoning capabilities of VLMs but also broadens their
applicability to complex visual question-answering scenarios. All code and
documentation supporting our work are available at
https://pypi.org/project/mm-poe/, enabling researchers and practitioners to
easily integrate and further develop these techniques.
☆ MIT-10M: A Large Scale Parallel Corpus of Multilingual Image Translation COLING 2025
Image Translation (IT) holds immense potential across diverse domains,
enabling the translation of textual content within images into various
languages. However, existing datasets often suffer from limitations in scale,
diversity, and quality, hindering the development and evaluation of IT models.
To address this issue, we introduce MIT-10M, a large-scale parallel corpus of
multilingual image translation with over 10M image-text pairs derived from
real-world data, which has undergone extensive data cleaning and multilingual
translation validation. It contains 840K images in three sizes, 28 categories,
tasks with three levels of difficulty and 14 languages image-text pairs, which
is a considerable improvement on existing datasets. We conduct extensive
experiments to evaluate and train models on MIT-10M. The experimental results
clearly indicate that our dataset has higher adaptability when it comes to
evaluating the performance of the models in tackling challenging and complex
image translation tasks in the real world. Moreover, the performance of the
model fine-tuned with MIT-10M has tripled compared to the baseline model,
further confirming its superiority.
comment: Accepted in COLING 2025
☆ Integrating MedCLIP and Cross-Modal Fusion for Automatic Radiology Report Generation IEEE
Automating radiology report generation can significantly reduce the workload
of radiologists and enhance the accuracy, consistency, and efficiency of
clinical documentation.We propose a novel cross-modal framework that uses
MedCLIP as both a vision extractor and a retrieval mechanism to improve the
process of medical report generation.By extracting retrieved report features
and image features through an attention-based extract module, and integrating
them with a fusion module, our method improves the coherence and clinical
relevance of generated reports.Experimental results on the widely used IU-Xray
dataset demonstrate the effectiveness of our approach, showing improvements
over commonly used methods in both report quality and relevance.Additionally,
ablation studies provide further validation of the framework, highlighting the
importance of accurate report retrieval and feature integration in generating
comprehensive medical reports.
comment: Accepted in IEEE Big Data 2024
☆ FIRE: Robust Detection of Diffusion-Generated Images via Frequency-Guided Reconstruction Error
The rapid advancement of diffusion models has significantly improved
high-quality image generation, making generated content increasingly
challenging to distinguish from real images and raising concerns about
potential misuse. In this paper, we observe that diffusion models struggle to
accurately reconstruct mid-band frequency information in real images,
suggesting the limitation could serve as a cue for detecting diffusion model
generated images. Motivated by this observation, we propose a novel method
called Frequency-guided Reconstruction Error (FIRE), which, to the best of our
knowledge, is the first to investigate the influence of frequency decomposition
on reconstruction error. FIRE assesses the variation in reconstruction error
before and after the frequency decomposition, offering a robust method for
identifying diffusion model generated images. Extensive experiments show that
FIRE generalizes effectively to unseen diffusion models and maintains
robustness against diverse perturbations.
comment: 14 pages, 14 figures
☆ A multimodal ensemble approach for clear cell renal cell carcinoma treatment outcome prediction
Purpose: A reliable cancer prognosis model for clear cell renal cell
carcinoma (ccRCC) can enhance personalized treatment. We developed a
multi-modal ensemble model (MMEM) that integrates pretreatment clinical data,
multi-omics data, and histopathology whole slide image (WSI) data to predict
overall survival (OS) and disease-free survival (DFS) for ccRCC patients.
Methods: We analyzed 226 patients from The Cancer Genome Atlas Kidney Renal
Clear Cell Carcinoma (TCGA-KIRC) dataset, which includes OS, DFS follow-up
data, and five data modalities: clinical data, WSIs, and three multi-omics
datasets (mRNA, miRNA, and DNA methylation). Separate survival models were
built for OS and DFS. Cox-proportional hazards (CPH) model with forward feature
selection is used for clinical and multi-omics data. Features from WSIs were
extracted using ResNet and three general-purpose foundation models. A deep
learning-based CPH model predicted survival using encoded WSI features. Risk
scores from all models were combined based on training performance. Results:
Performance was assessed using concordance index (C-index) and AUROC. The
clinical feature-based CPH model received the highest weight for both OS and
DFS tasks. Among WSI-based models, the general-purpose foundation model (UNI)
achieved the best performance. The final MMEM model surpassed single-modality
models, achieving C-indices of 0.820 (OS) and 0.833 (DFS), and AUROC values of
0.831 (3-year patient death) and 0.862 (cancer recurrence). Using predicted
risk medians to stratify high- and low-risk groups, log-rank tests showed
improved performance in both OS and DFS compared to single-modality models.
Conclusion: MMEM is the first multi-modal model for ccRCC patients, integrating
five data modalities. It outperformed single-modality models in prognostic
ability and has the potential to assist in ccRCC patient management if
independently validated.
comment: 10 pages, 3 figures, 4 tables
☆ Revisiting Lesion Tracking in 3D Total Body Photography
Wei-Lun Huang, Minghao Xue, Zhiyou Liu, Davood Tashayyod, Jun Kang, Amir Gandjbakhche, Misha Kazhdan, Mehran Armand
Melanoma is the most deadly form of skin cancer. Tracking the evolution of
nevi and detecting new lesions across the body is essential for the early
detection of melanoma. Despite prior work on longitudinal tracking of skin
lesions in 3D total body photography, there are still several challenges,
including 1) low accuracy for finding correct lesion pairs across scans, 2)
sensitivity to noisy lesion detection, and 3) lack of large-scale datasets with
numerous annotated lesion pairs. We propose a framework that takes in a pair of
3D textured meshes, matches lesions in the context of total body photography,
and identifies unmatchable lesions. We start by computing correspondence maps
bringing the source and target meshes to a template mesh. Using these maps to
define source/target signals over the template domain, we construct a flow
field aligning the mapped signals. The initial correspondence maps are then
refined by advecting forward/backward along the vector field. Finally, lesion
assignment is performed using the refined correspondence maps. We propose the
first large-scale dataset for skin lesion tracking with 25K lesion pairs across
198 subjects. The proposed method achieves a success rate of 89.9% (at 10 mm
criterion) for all pairs of annotated lesions and a matching accuracy of 98.2%
for subjects with more than 200 lesions.
☆ StyleMark: A Robust Watermarking Method for Art Style Images Against Black-Box Arbitrary Style Transfer
Arbitrary Style Transfer (AST) achieves the rendering of real natural images
into the painting styles of arbitrary art style images, promoting art
communication. However, misuse of unauthorized art style images for AST may
infringe on artists' copyrights. One countermeasure is robust watermarking,
which tracks image propagation by embedding copyright watermarks into carriers.
Unfortunately, AST-generated images lose the structural and semantic
information of the original style image, hindering end-to-end robust tracking
by watermarks. To fill this gap, we propose StyleMark, the first robust
watermarking method for black-box AST, which can be seamlessly applied to art
style images achieving precise attribution of artistic styles after AST.
Specifically, we propose a new style watermark network that adjusts the mean
activations of style features through multi-scale watermark embedding, thereby
planting watermark traces into the shared style feature space of style images.
Furthermore, we design a distribution squeeze loss, which constrain content
statistical feature distortion, forcing the reconstruction network to focus on
integrating style features with watermarks, thus optimizing the intrinsic
watermark distribution. Finally, based on solid end-to-end training, StyleMark
mitigates the optimization conflict between robustness and watermark
invisibility through decoder fine-tuning under random noise. Experimental
results demonstrate that StyleMark exhibits significant robustness against
black-box AST and common pixel-level distortions, while also securely defending
against malicious adaptive attacks.
☆ DiffCLIP: Few-shot Language-driven Multimodal Classifier
Visual language models like Contrastive Language-Image Pretraining (CLIP)
have shown impressive performance in analyzing natural images with language
information. However, these models often encounter challenges when applied to
specialized domains such as remote sensing due to the limited availability of
image-text pairs for training. To tackle this issue, we introduce DiffCLIP, a
novel framework that extends CLIP to effectively convey comprehensive
language-driven semantic information for accurate classification of
high-dimensional multimodal remote sensing images. DiffCLIP is a few-shot
learning method that leverages unlabeled images for pretraining. It employs
unsupervised mask diffusion learning to capture the distribution of diverse
modalities without requiring labels. The modality-shared image encoder maps
multimodal data into a unified subspace, extracting shared features with
consistent parameters across modalities. A well-trained image encoder further
enhances learning by aligning visual representations with class-label text
information from CLIP. By integrating these approaches, DiffCLIP significantly
boosts CLIP performance using a minimal number of image-text pairs. We evaluate
DiffCLIP on widely used high-dimensional multimodal datasets, demonstrating its
effectiveness in addressing few-shot annotated classification tasks. DiffCLIP
achieves an overall accuracy improvement of 10.65% across three remote sensing
datasets compared with CLIP, while utilizing only 2-shot image-text pairs. The
code has been released at https://github.com/icey-zhang/DiffCLIP.
☆ TT-MPD: Test Time Model Pruning and Distillation
Pruning can be an effective method of compressing large pre-trained models
for inference speed acceleration. Previous pruning approaches rely on access to
the original training dataset for both pruning and subsequent fine-tuning.
However, access to the training data can be limited due to concerns such as
data privacy and commercial confidentiality. Furthermore, with covariate shift
(disparities between test and training data distributions), pruning and
finetuning with training datasets can hinder the generalization of the pruned
model to test data. To address these issues, pruning and finetuning the model
with test time samples becomes essential. However, test-time model pruning and
fine-tuning incur additional computation costs and slow down the model's
prediction speed, thus posing efficiency issues. Existing pruning methods are
not efficient enough for test time model pruning setting, since finetuning the
pruned model is needed to evaluate the importance of removable components. To
address this, we propose two variables to approximate the fine-tuned accuracy.
We then introduce an efficient pruning method that considers the approximated
finetuned accuracy and potential inference latency saving. To enhance
fine-tuning efficiency, we propose an efficient knowledge distillation method
that only needs to generate pseudo labels for a small set of finetuning samples
one time, thereby reducing the expensive pseudo-label generation cost.
Experimental results demonstrate that our method achieves a comparable or
superior tradeoff between test accuracy and inference latency, with a 32%
relative reduction in pruning and finetuning time compared to the best existing
method.
☆ Maya: An Instruction Finetuned Multilingual Multimodal Model
Nahid Alam, Karthik Reddy Kanjula, Surya Guthikonda, Timothy Chung, Bala Krishna S Vegesna, Abhipsha Das, Anthony Susevski, Ryan Sze-Yin Chan, S M Iftekhar Uddin, Shayekh Bin Islam, Roshan Santhosh, Snegha A, Drishti Sharma, Chen Liu, Isha Chaturvedi, Genta Indra Winata, Ashvanth. S, Snehanshu Mukherjee, Alham Fikri Aji
The rapid development of large Vision-Language Models (VLMs) has led to
impressive results on academic benchmarks, primarily in widely spoken
languages. However, significant gaps remain in the ability of current VLMs to
handle low-resource languages and varied cultural contexts, largely due to a
lack of high-quality, diverse, and safety-vetted data. Consequently, these
models often struggle to understand low-resource languages and cultural nuances
in a manner free from toxicity. To address these limitations, we introduce
Maya, an open-source Multimodal Multilingual model. Our contributions are
threefold: 1) a multilingual image-text pretraining dataset in eight languages,
based on the LLaVA pretraining dataset; 2) a thorough analysis of toxicity
within the LLaVA dataset, followed by the creation of a novel toxicity-free
version across eight languages; and 3) a multilingual image-text model
supporting these languages, enhancing cultural and linguistic comprehension in
vision-language tasks. Code available at https://github.com/nahidalam/maya.
☆ A Powered Prosthetic Hand with Vision System for Enhancing the Anthropopathic Grasp
The anthropomorphism of grasping process significantly benefits the
experience and grasping efficiency of prosthetic hand wearers. Currently,
prosthetic hands controlled by signals such as brain-computer interfaces (BCI)
and electromyography (EMG) face difficulties in precisely recognizing the
amputees' grasping gestures and executing anthropomorphic grasp processes.
Although prosthetic hands equipped with vision systems enables the objects'
feature recognition, they lack perception of human grasping intention.
Therefore, this paper explores the estimation of grasping gestures solely
through visual data to accomplish anthropopathic grasping control and the
determination of grasping intention within a multi-object environment. To
address this, we propose the Spatial Geometry-based Gesture Mapping (SG-GM)
method, which constructs gesture functions based on the geometric features of
the human hand grasping processes. It's subsequently implemented on the
prosthetic hand. Furthermore, we propose the Motion Trajectory Regression-based
Grasping Intent Estimation (MTR-GIE) algorithm. This algorithm predicts
pre-grasping object utilizing regression prediction and prior spatial
segmentation estimation derived from the prosthetic hand's position and
trajectory. The experiments were conducted to grasp 8 common daily objects
including cup, fork, etc. The experimental results presented a similarity
coefficient $R^{2}$ of grasping process of 0.911, a Root Mean Squared Error
($RMSE$) of 2.47\degree, a success rate of grasping of 95.43$\%$, and an
average duration of grasping process of 3.07$\pm$0.41 s. Furthermore, grasping
experiments in a multi-object environment were conducted. The average accuracy
of intent estimation reached 94.35$\%$. Our methodologies offer a
groundbreaking approach to enhance the prosthetic hand's functionality and
provides valuable insights for future research.
☆ Primary visual cortex contributes to color constancy by predicting rather than discounting the illuminant: evidence from a computational study
Color constancy (CC) is an important ability of the human visual system to
stably perceive the colors of objects despite considerable changes in the color
of the light illuminating them. While increasing evidence from the field of
neuroscience supports that multiple levels of the visual system contribute to
the realization of CC, how the primary visual cortex (V1) plays role in CC is
not fully resolved. In specific, double-opponent (DO) neurons in V1 have been
thought to contribute to realizing a degree of CC, but the computational
mechanism is not clear. We build an electrophysiologically based V1 neural
model to learn the color of the light source from a natural image dataset with
the ground truth illuminants as the labels. Based on the qualitative and
quantitative analysis of the responsive properties of the learned model
neurons, we found that both the spatial structures and color weights of the
receptive fields of the learned model neurons are quite similar to those of the
simple and DO neurons recorded in V1. Computationally, DO cells perform more
robustly than the simple cells in V1 for illuminant prediction. Therefore, this
work provides computational evidence supporting that V1 DO neurons serve to
realize color constancy by encoding the illuminant,which is contradictory to
the common hypothesis that V1 contributes to CC by discounting the illuminant
using its DO cells. This evidence is expected to not only help resolve the
visual mechanisms of CC, but also provide inspiration to develop more effective
computer vision models.
comment: 26 pages, 11 figures
☆ Creative Portraiture: Exploring Creative Adversarial Networks and Conditional Creative Adversarial Networks
Convolutional neural networks (CNNs) have been combined with generative
adversarial networks (GANs) to create deep convolutional generative adversarial
networks (DCGANs) with great success. DCGANs have been used for generating
images and videos from creative domains such as fashion design and painting. A
common critique of the use of DCGANs in creative applications is that they are
limited in their ability to generate creative products because the generator
simply learns to copy the training distribution. We explore an extension of
DCGANs, creative adversarial networks (CANs). Using CANs, we generate novel,
creative portraits, using the WikiArt dataset to train the network. Moreover,
we introduce our extension of CANs, conditional creative adversarial networks
(CCANs), and demonstrate their potential to generate creative portraits
conditioned on a style label. We argue that generating products that are
conditioned, or inspired, on a style label closely emulates real creative
processes in which humans produce imaginative work that is still rooted in
previous styles.
☆ EvRepSL: Event-Stream Representation via Self-Supervised Learning for Event-Based Vision IEEE
Event-stream representation is the first step for many computer vision tasks
using event cameras. It converts the asynchronous event-streams into a
formatted structure so that conventional machine learning models can be applied
easily. However, most of the state-of-the-art event-stream representations are
manually designed and the quality of these representations cannot be guaranteed
due to the noisy nature of event-streams. In this paper, we introduce a
data-driven approach aiming at enhancing the quality of event-stream
representations. Our approach commences with the introduction of a new
event-stream representation based on spatial-temporal statistics, denoted as
EvRep. Subsequently, we theoretically derive the intrinsic relationship between
asynchronous event-streams and synchronous video frames. Building upon this
theoretical relationship, we train a representation generator, RepGen, in a
self-supervised learning manner accepting EvRep as input. Finally, the
event-streams are converted to high-quality representations, termed as EvRepSL,
by going through the learned RepGen (without the need of fine-tuning or
retraining). Our methodology is rigorously validated through extensive
evaluations on a variety of mainstream event-based classification and optical
flow datasets (captured with various types of event cameras). The experimental
results highlight not only our approach's superior performance over existing
event-stream representations but also its versatility, being agnostic to
different event cameras and tasks.
comment: Published on IEEE Transactions on Image Processing
☆ Light Field Image Quality Assessment With Auxiliary Learning Based on Depthwise and Anglewise Separable Convolutions
In multimedia broadcasting, no-reference image quality assessment (NR-IQA) is
used to indicate the user-perceived quality of experience (QoE) and to support
intelligent data transmission while optimizing user experience. This paper
proposes an improved no-reference light field image quality assessment
(NR-LFIQA) metric for future immersive media broadcasting services. First, we
extend the concept of depthwise separable convolution (DSC) to the spatial
domain of light field image (LFI) and introduce "light field depthwise
separable convolution (LF-DSC)", which can extract the LFI's spatial features
efficiently. Second, we further theoretically extend the LF-DSC to the angular
space of LFI and introduce the novel concept of "light field anglewise
separable convolution (LF-ASC)", which is capable of extracting both the
spatial and angular features for comprehensive quality assessment with low
complexity. Third, we define the spatial and angular feature estimations as
auxiliary tasks in aiding the primary NR-LFIQA task by providing spatial and
angular quality features as hints. To the best of our knowledge, this work is
the first exploration of deep auxiliary learning with spatial-angular hints on
NR-LFIQA. Experiments were conducted in mainstream LFI datasets such as
Win5-LID and SMART with comparisons to the mainstream full reference IQA
metrics as well as the state-of-the-art NR-LFIQA methods. The experimental
results show that the proposed metric yields overall 42.86% and 45.95% smaller
prediction errors than the second-best benchmarking metric in Win5-LID and
SMART, respectively. In some challenging cases with particular distortion
types, the proposed metric can reduce the errors significantly by more than
60%.
☆ Retaining and Enhancing Pre-trained Knowledge in Vision-Language Models with Prompt Ensembling WACV
The advancement of vision-language models, particularly the Contrastive
Language-Image Pre-training (CLIP) model, has revolutionized the field of
machine learning by enabling robust zero-shot learning capabilities. These
capabilities allow models to understand and respond to previously unseen data
without task-specific training. However, adapting CLIP to integrate specialized
knowledge from various domains while retaining its zero-shot capabilities
remains a significant challenge. To address this, we introduce a novel prompt
ensemble learning approach called Group-wise Prompt Ensemble (GPE). This method
aims to enhance CLIP's zero-shot capabilities by incorporating new domain
knowledge while improving its adaptability and robustness against data
distribution shifts. Our approach hinges on three main strategies: prompt
grouping with masked attention to optimize CLIP's adaptability while
safeguarding its zero-shot capabilities; the incorporation of auxiliary prompts
for the seamless integration of new domain insights without disrupting the
original model's representation; and an ensemble learning strategy that
effectively merges original and new knowledge. Through rigorous
experimentation, including more challenging cross-dataset transfer evaluations,
our GPE method redefines the benchmarks for the adaptability and efficiency of
vision-language models, surpassing existing models across various scenarios.
comment: IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)
2025
☆ Stable Mean Teacher for Semi-supervised Video Action Detection AAAI
In this work, we focus on semi-supervised learning for video action
detection. Video action detection requires spatiotemporal localization in
addition to classification, and a limited amount of labels makes the model
prone to unreliable predictions. We present Stable Mean Teacher, a simple
end-to-end teacher-based framework that benefits from improved and temporally
consistent pseudo labels. It relies on a novel Error Recovery (EoR) module,
which learns from students' mistakes on labeled samples and transfers this
knowledge to the teacher to improve pseudo labels for unlabeled samples.
Moreover, existing spatiotemporal losses do not take temporal coherency into
account and are prone to temporal inconsistencies. To address this, we present
Difference of Pixels (DoP), a simple and novel constraint focused on temporal
consistency, leading to coherent temporal detections. We evaluate our approach
on four different spatiotemporal detection benchmarks: UCF101-24, JHMDB21, AVA,
and YouTube-VOS. Our approach outperforms the supervised baselines for action
detection by an average margin of 23.5% on UCF101-24, 16% on JHMDB21, and 3.3%
on AVA. Using merely 10% and 20% of data, it provides competitive performance
compared to the supervised baseline trained on 100% annotations on UCF101-24
and JHMDB21, respectively. We further evaluate its effectiveness on AVA for
scaling to large-scale datasets and YouTube-VOS for video object segmentation,
demonstrating its generalization capability to other tasks in the video domain.
Code and models are publicly available.
comment: AAAI Conference on Artificial Intelligence, Main Technical Track
(AAAI), 2025, Code: https://github.com/AKASH2907/stable_mean_teacher
♻ ☆ [MASK] is All You Need
In generative models, two paradigms have gained attraction in various
applications: next-set prediction-based Masked Generative Models and next-noise
prediction-based Non-Autoregressive Models, e.g., Diffusion Models. In this
work, we propose using discrete-state models to connect them and explore their
scalability in the vision domain. First, we conduct a step-by-step analysis in
a unified design space across two types of models including
timestep-independence, noise schedule, temperature, guidance strength, etc in a
scalable manner. Second, we re-cast typical discriminative tasks, e.g., image
segmentation, as an unmasking process from [MASK] tokens on a discrete-state
model. This enables us to perform various sampling processes, including
flexible conditional sampling by only training once to model the joint
distribution. All aforementioned explorations lead to our framework named
Discrete Interpolants, which enables us to achieve state-of-the-art or
competitive performance compared to previous discrete-state based methods in
various benchmarks, like ImageNet256, MS COCO, and video dataset FaceForensics.
In summary, by leveraging [MASK] in discrete-state models, we can bridge Masked
Generative and Non-autoregressive Diffusion models, as well as generative and
discriminative tasks.
comment: Technical Report (WIP), Project Page(code, model, dataset):
https://compvis.github.io/mask/
♻ ☆ ContRail: A Framework for Realistic Railway Image Synthesis using ControlNet
Deep Learning became an ubiquitous paradigm due to its extraordinary
effectiveness and applicability in numerous domains. However, the approach
suffers from the high demand of data required to achieve the potential of this
type of model. An ever-increasing sub-field of Artificial Intelligence, Image
Synthesis, aims to address this limitation through the design of intelligent
models capable of creating original and realistic images, endeavour which could
drastically reduce the need for real data. The Stable Diffusion generation
paradigm recently propelled state-of-the-art approaches to exceed all previous
benchmarks. In this work, we propose the ContRail framework based on the novel
Stable Diffusion model ControlNet, which we empower through a multi-modal
conditioning method. We experiment with the task of synthetic railway image
generation, where we improve the performance in rail-specific tasks, such as
rail semantic segmentation by enriching the dataset with realistic synthetic
images.
comment: 9 pages, 5 figures, 2 tables
♻ ☆ VP-MEL: Visual Prompts Guided Multimodal Entity Linking
Multimodal Entity Linking (MEL) is extensively utilized in the domains of
information retrieval. However, existing MEL methods typically utilize mention
words as mentions for retrieval. This results in a significant dependence of
MEL on mention words, thereby constraining its capacity to effectively leverage
information from both images and text. In situations where mention words are
absent, MEL methods struggle to leverage image-text pairs for entity linking.
To solve these issues, we introduce a Visual Prompts guided Multimodal Entity
Linking (VP-MEL) task. VP-MEL directly marks specific regions within the image.
These markers are referred to as visual prompts in VP-MEL. Without mention
words, VP-MEL aims to utilize marked image-text pairs to align visual prompts
with specific entities in the knowledge bases. A new dataset for the VP-MEL
task, VPWiki, is proposed in this paper. Moreover, we propose a framework named
FBMEL, which enhances the significance of visual prompts and fully leverages
the information in image-text pairs. Experimental results on the VPWiki dataset
demonstrate that FBMEL outperforms baseline methods across multiple benchmarks
for the VP-MEL task.
♻ ☆ A No-Reference Medical Image Quality Assessment Method Based on Automated Distortion Recognition Technology: Application to Preprocessing in MRI-guided Radiotherapy
Zilin Wang, Shengqi Chen, Jianrong Dai, Shirui Qin, Ying Cao, Ruiao Zhao, Guohua Wu, Yuan Tang, Jiayun Chen
Objective:To develop a no-reference image quality assessment method using
automated distortion recognition to boost MRI-guided radiotherapy
precision.Methods:We analyzed 106,000 MR images from 10 patients with liver
metastasis,captured with the Elekta Unity MR-LINAC.Our No-Reference Quality
Assessment Model includes:1)image preprocessing to enhance visibility of key
diagnostic features;2)feature extraction and directional analysis using MSCN
coefficients across four directions to capture textural attributes and
gradients,vital for identifying image features and potential
distortions;3)integrative Quality Index(QI)calculation,which integrates
features via AGGD parameter estimation and K-means clustering.The QI,based on a
weighted MAD computation of directional scores,provides a comprehensive image
quality measure,robust against outliers.LOO-CV assessed model generalizability
and performance.Tumor tracking algorithm performance was compared with and
without preprocessing to verify tracking accuracy
enhancements.Results:Preprocessing significantly improved image quality,with
the QI showing substantial positive changes and surpassing other metrics.After
normalization,the QI's average value was 79.6 times higher than CNR,indicating
improved image definition and contrast.It also showed higher sensitivity in
detail recognition with average values 6.5 times and 1.7 times higher than
Tenengrad gradient and entropy.The tumor tracking algorithm confirmed
significant tracking accuracy improvements with preprocessed images,validating
preprocessing effectiveness.Conclusions:This study introduces a novel
no-reference image quality evaluation method based on automated distortion
recognition,offering a new quality control tool for MRIgRT tumor tracking.It
enhances clinical application accuracy and facilitates medical image quality
assessment standardization, with significant clinical and research value.
♻ ☆ AnomalyControl: Learning Cross-modal Semantic Features for Controllable Anomaly Synthesis
Anomaly synthesis is a crucial approach to augment abnormal data for
advancing anomaly inspection. Based on the knowledge from the large-scale
pre-training, existing text-to-image anomaly synthesis methods predominantly
focus on textual information or coarse-aligned visual features to guide the
entire generation process. However, these methods often lack sufficient
descriptors to capture the complicated characteristics of realistic anomalies
(e.g., the fine-grained visual pattern of anomalies), limiting the realism and
generalization of the generation process. To this end, we propose a novel
anomaly synthesis framework called AnomalyControl to learn cross-modal semantic
features as guidance signals, which could encode the generalized anomaly cues
from text-image reference prompts and improve the realism of synthesized
abnormal samples. Specifically, AnomalyControl adopts a flexible and
non-matching prompt pair (i.e., a text-image reference prompt and a targeted
text prompt), where a Cross-modal Semantic Modeling (CSM) module is designed to
extract cross-modal semantic features from the textual and visual descriptors.
Then, an Anomaly-Semantic Enhanced Attention (ASEA) mechanism is formulated to
allow CSM to focus on the specific visual patterns of the anomaly, thus
enhancing the realism and contextual relevance of the generated anomaly
features. Treating cross-modal semantic features as the prior, a Semantic
Guided Adapter (SGA) is designed to encode effective guidance signals for the
adequate and controllable synthesis process. Extensive experiments indicate
that AnomalyControl can achieve state-of-the-art results in anomaly synthesis
compared with existing methods while exhibiting superior performance for
downstream tasks.
♻ ☆ Agent Journey Beyond RGB: Unveiling Hybrid Semantic-Spatial Environmental Representations for Vision-and-Language Navigation CVPR 2025
Navigating unseen environments based on natural language instructions remains
difficult for egocentric agents in Vision-and-Language Navigation (VLN). While
recent advancements have yielded promising outcomes, they primarily rely on RGB
images for environmental representation, often overlooking the underlying
semantic knowledge and spatial cues. Intuitively, humans inherently ground
textual semantics within the spatial layout during indoor navigation. Inspired
by this, we propose a versatile Semantic Understanding and Spatial Awareness
(SUSA) architecture to facilitate navigation. SUSA includes a Textual Semantic
Understanding (TSU) module, which narrows the modality gap between instructions
and environments by generating and associating the descriptions of
environmental landmarks in the agent's immediate surroundings. Additionally, a
Depth-based Spatial Perception (DSP) module incrementally constructs a depth
exploration map, enabling a more nuanced comprehension of environmental
layouts. Experimental results demonstrate that SUSA hybrid semantic-spatial
representations effectively enhance navigation performance, setting new
state-of-the-art performance across three VLN benchmarks (REVERIE, R2R, and
SOON). The source code will be publicly available.
comment: underreview in CVPR 2025
♻ ☆ Normalizing Flows are Capable Generative Models
Shuangfei Zhai, Ruixiang Zhang, Preetum Nakkiran, David Berthelot, Jiatao Gu, Huangjie Zheng, Tianrong Chen, Miguel Angel Bautista, Navdeep Jaitly, Josh Susskind
Normalizing Flows (NFs) are likelihood-based models for continuous inputs.
They have demonstrated promising results on both density estimation and
generative modeling tasks, but have received relatively little attention in
recent years. In this work, we demonstrate that NFs are more powerful than
previously believed. We present TarFlow: a simple and scalable architecture
that enables highly performant NF models. TarFlow can be thought of as a
Transformer-based variant of Masked Autoregressive Flows (MAFs): it consists of
a stack of autoregressive Transformer blocks on image patches, alternating the
autoregression direction between layers. TarFlow is straightforward to train
end-to-end, and capable of directly modeling and generating pixels. We also
propose three key techniques to improve sample quality: Gaussian noise
augmentation during training, a post training denoising procedure, and an
effective guidance method for both class-conditional and unconditional
settings. Putting these together, TarFlow sets new state-of-the-art results on
likelihood estimation for images, beating the previous best methods by a large
margin, and generates samples with quality and diversity comparable to
diffusion models, for the first time with a stand-alone NF model. We make our
code available at https://github.com/apple/ml-tarflow.
♻ ☆ Hard-normal Example-aware Template Mutual Matching for Industrial Anomaly Detection
Anomaly detectors are widely used in industrial manufacturing to detect and
localize unknown defects in query images. These detectors are trained on
anomaly-free samples and have successfully distinguished anomalies from most
normal samples. However, hard-normal examples are scattered and far apart from
most normal samples, and thus they are often mistaken for anomalies by existing
methods. To address this issue, we propose Hard-normal Example-aware Template
Mutual Matching (HETMM), an efficient framework to build a robust
prototype-based decision boundary. Specifically, HETMM employs the proposed
Affine-invariant Template Mutual Matching (ATMM) to mitigate the affection
brought by the affine transformations and easy-normal examples. By mutually
matching the pixel-level prototypes within the patch-level search spaces
between query and template set, ATMM can accurately distinguish between
hard-normal examples and anomalies, achieving low false-positive and
missed-detection rates. In addition, we also propose PTS to compress the
original template set for speed-up. PTS selects cluster centres and hard-normal
examples to preserve the original decision boundary, allowing this tiny set to
achieve comparable performance to the original one. Extensive experiments
demonstrate that HETMM outperforms state-of-the-art methods, while using a
60-sheet tiny set can achieve competitive performance and real-time inference
speed (around 26.1 FPS) on a Quadro 8000 RTX GPU. HETMM is training-free and
can be hot-updated by directly inserting novel samples into the template set,
which can promptly address some incremental learning issues in industrial
manufacturing.
comment: This paper is recently accepted in the International Journal of
Computer Vision (IJCV). Please see our code at
https://github.com/NarcissusEx/HETMM
♻ ☆ Track4Gen: Teaching Video Diffusion Models to Track Points Improves Video Generation
While recent foundational video generators produce visually rich output, they
still struggle with appearance drift, where objects gradually degrade or change
inconsistently across frames, breaking visual coherence. We hypothesize that
this is because there is no explicit supervision in terms of spatial tracking
at the feature level. We propose Track4Gen, a spatially aware video generator
that combines video diffusion loss with point tracking across frames, providing
enhanced spatial supervision on the diffusion features. Track4Gen merges the
video generation and point tracking tasks into a single network by making
minimal changes to existing video generation architectures. Using Stable Video
Diffusion as a backbone, Track4Gen demonstrates that it is possible to unify
video generation and point tracking, which are typically handled as separate
tasks. Our extensive evaluations show that Track4Gen effectively reduces
appearance drift, resulting in temporally stable and visually coherent video
generation. Project page: hyeonho99.github.io/track4gen
comment: Project page: hyeonho99.github.io/track4gen
♻ ☆ MID: A Comprehensive Shore-Based Dataset for Multi-Scale Dense Ship Occlusion and Interaction Scenarios
This paper introduces the Maritime Ship Navigation Behavior Dataset (MID),
designed to address challenges in ship detection within complex maritime
environments using Oriented Bounding Boxes (OBB). MID contains 5,673 images
with 135,884 finely annotated target instances, supporting both supervised and
semi-supervised learning. It features diverse maritime scenarios such as ship
encounters under varying weather, docking maneuvers, small target clustering,
and partial occlusions, filling critical gaps in datasets like HRSID, SSDD, and
NWPU-10. MID's images are sourced from high-definition video clips of
real-world navigation across 43 water areas, with varied weather and lighting
conditions (e.g., rain, fog). Manually curated annotations enhance the
dataset's variety, ensuring its applicability to real-world demands in busy
ports and dense maritime regions. This diversity equips models trained on MID
to better handle complex, dynamic environments, supporting advancements in
maritime situational awareness. To validate MID's utility, we evaluated 10
detection algorithms, providing an in-depth analysis of the dataset, detection
results from various models, and a comparative study of baseline algorithms,
with a focus on handling occlusions and dense target clusters. The results
highlight MID's potential to drive innovation in intelligent maritime traffic
monitoring and autonomous navigation systems. The dataset will be made publicly
available at https://github.com/VirtualNew/MID_DataSet.
♻ ☆ BudgetFusion: Perceptually-Guided Adaptive Diffusion Models
Diffusion models have shown unprecedented success in the task of
text-to-image generation. While these models are capable of generating
high-quality and realistic images, the complexity of sequential denoising has
raised societal concerns regarding high computational demands and energy
consumption. In response, various efforts have been made to improve inference
efficiency. However, most of the existing efforts have taken a fixed approach
with neural network simplification or text prompt optimization. Are the quality
improvements from all denoising computations equally perceivable to humans? We
observed that images from different text prompts may require different
computational efforts given the desired content. The observation motivates us
to present BudgetFusion, a novel model that suggests the most perceptually
efficient number of diffusion steps before a diffusion model starts to generate
an image. This is achieved by predicting multi-level perceptual metrics
relative to diffusion steps. With the popular Stable Diffusion as an example,
we conduct both numerical analyses and user studies. Our experiments show that
BudgetFusion saves up to five seconds per prompt without compromising
perceptual similarity. We hope this work can initiate efforts toward answering
a core question: how much do humans perceptually gain from images created by a
generative model, per watt of energy?
♻ ☆ M3TR: Generalist HD Map Construction with Variable Map Priors
Autonomous vehicles require road information for their operation, usually in
form of HD maps. Since offline maps eventually become outdated or may only be
partially available, online HD map construction methods have been proposed to
infer map information from live sensor data. A key issue remains how to exploit
such partial or outdated map information as a prior. We introduce M3TR
(Multi-Masking Map Transformer), a generalist approach for HD map construction
both with and without map priors. We address shortcomings in ground truth
generation for Argoverse 2 and nuScenes and propose the first realistic
scenarios with semantically diverse map priors. Examining various query
designs, we use an improved method for integrating prior map elements into a HD
map construction model, increasing performance by +4.3 mAP. Finally, we show
that training across all prior scenarios yields a single Generalist model,
whose performance is on par with previous Expert models that can handle only
one specific type of map prior. M3TR thus is the first model capable of
leveraging variable map priors, making it suitable for real-world deployment.
Code is available at https://github.com/immel-f/m3tr
♻ ☆ MoRAG -- Multi-Fusion Retrieval Augmented Generation for Human Motion
We introduce MoRAG, a novel multi-part fusion based retrieval-augmented
generation strategy for text-based human motion generation. The method enhances
motion diffusion models by leveraging additional knowledge obtained through an
improved motion retrieval process. By effectively prompting large language
models (LLMs), we address spelling errors and rephrasing issues in motion
retrieval. Our approach utilizes a multi-part retrieval strategy to improve the
generalizability of motion retrieval across the language space. We create
diverse samples through the spatial composition of the retrieved motions.
Furthermore, by utilizing low-level, part-specific motion information, we can
construct motion samples for unseen text descriptions. Our experiments
demonstrate that our framework can serve as a plug-and-play module, improving
the performance of motion diffusion models. Code, pretrained models and sample
videos are available at: https://motion-rag.github.io/
♻ ☆ Why Fine-grained Labels in Pretraining Benefit Generalization?
Recent studies show that pretraining a deep neural network with fine-grained
labeled data, followed by fine-tuning on coarse-labeled data for downstream
tasks, often yields better generalization than pretraining with coarse-labeled
data. While there is ample empirical evidence supporting this, the theoretical
justification remains an open problem. This paper addresses this gap by
introducing a "hierarchical multi-view" structure to confine the input data
distribution. Under this framework, we prove that: 1) coarse-grained
pretraining only allows a neural network to learn the common features well,
while 2) fine-grained pretraining helps the network learn the rare features in
addition to the common ones, leading to improved accuracy on hard downstream
test samples.
comment: arXiv admin note: substantial text overlap with arXiv:2303.16887
♻ ☆ Neural Localizer Fields for Continuous 3D Human Pose and Shape Estimation NeurIPS 2024
With the explosive growth of available training data, single-image 3D human
modeling is ahead of a transition to a data-centric paradigm. A key to
successfully exploiting data scale is to design flexible models that can be
supervised from various heterogeneous data sources produced by different
researchers or vendors. To this end, we propose a simple yet powerful paradigm
for seamlessly unifying different human pose and shape-related tasks and
datasets. Our formulation is centered on the ability -- both at training and
test time -- to query any arbitrary point of the human volume, and obtain its
estimated location in 3D. We achieve this by learning a continuous neural field
of body point localizer functions, each of which is a differently parameterized
3D heatmap-based convolutional point localizer (detector). For generating
parametric output, we propose an efficient post-processing step for fitting
SMPL-family body models to nonparametric joint and vertex predictions. With
this approach, we can naturally exploit differently annotated data sources
including mesh, 2D/3D skeleton and dense pose, without having to convert
between them, and thereby train large-scale 3D human mesh and skeleton
estimation models that considerably outperform the state-of-the-art on several
public benchmarks including 3DPW, EMDB, EHF, SSP-3D and AGORA.
comment: Accepted at NeurIPS 2024
♻ ☆ Toon3D: Seeing Cartoons from New Perspectives
We recover the underlying 3D structure from images of cartoons and anime
depicting the same scene. This is an interesting problem domain because images
in creative media are often depicted without explicit geometric consistency for
storytelling and creative expression-they are only 3D in a qualitative sense.
While humans can easily perceive the underlying 3D scene from these images,
existing Structure-from-Motion (SfM) methods that assume 3D consistency fail
catastrophically. We present Toon3D for reconstructing geometrically
inconsistent images. Our key insight is to deform the input images while
recovering camera poses and scene geometry, effectively explaining away
geometrical inconsistencies to achieve consistency. This process is guided by
the structure inferred from monocular depth predictions. We curate a dataset
with multi-view imagery from cartoons and anime that we annotate with reliable
sparse correspondences using our user-friendly annotation tool. Our recovered
point clouds can be plugged into novel-view synthesis methods to experience
cartoons from viewpoints never drawn before. We evaluate against classical and
recent learning-based SfM methods, where Toon3D is able to obtain more reliable
camera poses and scene geometry.
comment: Please see our project page: https://toon3d.studio
♻ ☆ Deep Compression Autoencoder for Efficient High-Resolution Diffusion Models
We present Deep Compression Autoencoder (DC-AE), a new family of autoencoder
models for accelerating high-resolution diffusion models. Existing autoencoder
models have demonstrated impressive results at a moderate spatial compression
ratio (e.g., 8x), but fail to maintain satisfactory reconstruction accuracy for
high spatial compression ratios (e.g., 64x). We address this challenge by
introducing two key techniques: (1) Residual Autoencoding, where we design our
models to learn residuals based on the space-to-channel transformed features to
alleviate the optimization difficulty of high spatial-compression autoencoders;
(2) Decoupled High-Resolution Adaptation, an efficient decoupled three-phases
training strategy for mitigating the generalization penalty of high
spatial-compression autoencoders. With these designs, we improve the
autoencoder's spatial compression ratio up to 128 while maintaining the
reconstruction quality. Applying our DC-AE to latent diffusion models, we
achieve significant speedup without accuracy drop. For example, on ImageNet
512x512, our DC-AE provides 19.1x inference speedup and 17.9x training speedup
on H100 GPU for UViT-H while achieving a better FID, compared with the widely
used SD-VAE-f8 autoencoder. Our code is available at
https://github.com/mit-han-lab/efficientvit.
comment: Preprint. First two authors contributed equally to this work. Update:
add diffusion model scaling results
♻ ☆ AFD: Mitigating Feature Gap for Adversarial Robustness by Feature Disentanglement
Adversarial fine-tuning methods enhance adversarial robustness via
fine-tuning the pre-trained model in an adversarial training manner. However,
we identify that some specific latent features of adversarial samples are
confused by adversarial perturbation and lead to an unexpectedly increasing gap
between features in the last hidden layer of natural and adversarial samples.
To address this issue, we propose a disentanglement-based approach to
explicitly model and further remove the specific latent features. We introduce
a feature disentangler to separate out the specific latent features from the
features of the adversarial samples, thereby boosting robustness by eliminating
the specific latent features. Besides, we align clean features in the
pre-trained model with features of adversarial samples in the fine-tuned model,
to benefit from the intrinsic features of natural samples. Empirical
evaluations on three benchmark datasets demonstrate that our approach surpasses
existing adversarial fine-tuning methods and adversarial training baselines.
comment: 7 pages, 5 figures
♻ ☆ AFFSegNet: Adaptive Feature Fusion Segmentation Network for Microtumors and Multi-Organ Segmentation
Fuchen Zheng, Xinyi Chen, Xuhang Chen, Haolun Li, Xiaojiao Guo, Weihuang Liu, Chi-Man Pun, Shoujun Zhou
Medical image segmentation, a crucial task in computer vision, facilitates
the automated delineation of anatomical structures and pathologies, supporting
clinicians in diagnosis, treatment planning, and disease monitoring. Notably,
transformers employing shifted window-based self-attention have demonstrated
exceptional performance. However, their reliance on local window attention
limits the fusion of local and global contextual information, crucial for
segmenting microtumors and miniature organs. To address this limitation, we
propose the Adaptive Semantic Segmentation Network (ASSNet), a transformer
architecture that effectively integrates local and global features for precise
medical image segmentation. ASSNet comprises a transformer-based U-shaped
encoder-decoder network. The encoder utilizes shifted window self-attention
across five resolutions to extract multi-scale features, which are then
propagated to the decoder through skip connections. We introduce an augmented
multi-layer perceptron within the encoder to explicitly model long-range
dependencies during feature extraction. Recognizing the constraints of
conventional symmetrical encoder-decoder designs, we propose an Adaptive
Feature Fusion (AFF) decoder to complement our encoder. This decoder
incorporates three key components: the Long Range Dependencies (LRD) block, the
Multi-Scale Feature Fusion (MFF) block, and the Adaptive Semantic Center (ASC)
block. These components synergistically facilitate the effective fusion of
multi-scale features extracted by the decoder while capturing long-range
dependencies and refining object boundaries. Comprehensive experiments on
diverse medical image segmentation tasks, including multi-organ, liver tumor,
and bladder tumor segmentation, demonstrate that ASSNet achieves
state-of-the-art results. Code and models are available at:
\url{https://github.com/lzeeorno/ASSNet}.
comment: 8 pages, 4 figures, 3 tables
♻ ☆ Unlocking Feature Visualization for Deeper Networks with MAgnitude Constrained Optimization
Thomas Fel, Thibaut Boissin, Victor Boutin, Agustin Picard, Paul Novello, Julien Colin, Drew Linsley, Tom Rousseau, Rémi Cadène, Lore Goetschalckx, Laurent Gardes, Thomas Serre
Feature visualization has gained substantial popularity, particularly after
the influential work by Olah et al. in 2017, which established it as a crucial
tool for explainability. However, its widespread adoption has been limited due
to a reliance on tricks to generate interpretable images, and corresponding
challenges in scaling it to deeper neural networks. Here, we describe MACO, a
simple approach to address these shortcomings. The main idea is to generate
images by optimizing the phase spectrum while keeping the magnitude constant to
ensure that generated explanations lie in the space of natural images. Our
approach yields significantly better results (both qualitatively and
quantitatively) and unlocks efficient and interpretable feature visualizations
for large state-of-the-art neural networks. We also show that our approach
exhibits an attribution mechanism allowing us to augment feature visualizations
with spatial importance. We validate our method on a novel benchmark for
comparing feature visualization methods, and release its visualizations for all
classes of the ImageNet dataset on https://serre-lab.github.io/Lens/.
Overall, our approach unlocks, for the first time, feature visualizations for
large, state-of-the-art deep neural networks without resorting to any
parametric prior image model.
♻ ☆ Stable-Hair: Real-World Hair Transfer via Diffusion Model
Current hair transfer methods struggle to handle diverse and intricate
hairstyles, limiting their applicability in real-world scenarios. In this
paper, we propose a novel diffusion-based hair transfer framework, named
\textit{Stable-Hair}, which robustly transfers a wide range of real-world
hairstyles to user-provided faces for virtual hair try-on. To achieve this
goal, our Stable-Hair framework is designed as a two-stage pipeline. In the
first stage, we train a Bald Converter alongside stable diffusion to remove
hair from the user-provided face images, resulting in bald images. In the
second stage, we specifically designed a Hair Extractor and a Latent
IdentityNet to transfer the target hairstyle with highly detailed and
high-fidelity to the bald image. The Hair Extractor is trained to encode
reference images with the desired hairstyles, while the Latent IdentityNet
ensures consistency in identity and background. To minimize color deviations
between source images and transfer results, we introduce a novel Latent
ControlNet architecture, which functions as both the Bald Converter and Latent
IdentityNet. After training on our curated triplet dataset, our method
accurately transfers highly detailed and high-fidelity hairstyles to the source
images. Extensive experiments demonstrate that our approach achieves
state-of-the-art performance compared to existing hair transfer methods.
Project page:
\textcolor{red}{\url{https://xiaojiu-z.github.io/Stable-Hair.github.io/}}
♻ ☆ Unsupervised Learning of Unbiased Visual Representations IEEE
Deep neural networks often struggle to learn robust representations in the
presence of dataset biases, leading to suboptimal generalization on unbiased
datasets. This limitation arises because the models heavily depend on
peripheral and confounding factors, inadvertently acquired during training.
Existing approaches to address this problem typically involve explicit
supervision of bias attributes or reliance on prior knowledge about the biases.
In this study, we address the challenging scenario where no explicit
annotations of bias are available, and there's no prior knowledge about its
nature. We present a fully unsupervised debiasing framework with three key
steps: firstly, leveraging the inherent tendency to learn malignant biases to
acquire a bias-capturing model; next, employing a pseudo-labeling process to
obtain bias labels; and finally, applying cutting-edge supervised debiasing
techniques to achieve an unbiased model. Additionally, we introduce a
theoretical framework for evaluating model biasedness and conduct a detailed
analysis of how biases impact neural network training. Experimental results on
both synthetic and real-world datasets demonstrate the effectiveness of our
method, showcasing state-of-the-art performance in various settings,
occasionally surpassing fully supervised debiasing approaches.
comment: Accepted at IEEE Transactions on Artificial Intelligence (TAI)
♻ ☆ DeCLIP: Decoding CLIP representations for deepfake localization WACV
Generative models can create entirely new images, but they can also partially
modify real images in ways that are undetectable to the human eye. In this
paper, we address the challenge of automatically detecting such local
manipulations. One of the most pressing problems in deepfake detection remains
the ability of models to generalize to different classes of generators. In the
case of fully manipulated images, representations extracted from large
self-supervised models (such as CLIP) provide a promising direction towards
more robust detectors. Here, we introduce DeCLIP, a first attempt to leverage
such large pretrained features for detecting local manipulations. We show that,
when combined with a reasonably large convolutional decoder, pretrained
self-supervised representations are able to perform localization and improve
generalization capabilities over existing methods. Unlike previous work, our
approach is able to perform localization on the challenging case of latent
diffusion models, where the entire image is affected by the fingerprint of the
generator. Moreover, we observe that this type of data, which combines local
semantic information with a global fingerprint, provides more stable
generalization than other categories of generative methods.
comment: Accepted at Winter Conference on Applications of Computer Vision
(WACV) 2025
♻ ☆ Bigger is not Always Better: Scaling Properties of Latent Diffusion Models
We study the scaling properties of latent diffusion models (LDMs) with an
emphasis on their sampling efficiency. While improved network architecture and
inference algorithms have shown to effectively boost sampling efficiency of
diffusion models, the role of model size -- a critical determinant of sampling
efficiency -- has not been thoroughly examined. Through empirical analysis of
established text-to-image diffusion models, we conduct an in-depth
investigation into how model size influences sampling efficiency across varying
sampling steps. Our findings unveil a surprising trend: when operating under a
given inference budget, smaller models frequently outperform their larger
equivalents in generating high-quality results. Moreover, we extend our study
to demonstrate the generalizability of the these findings by applying various
diffusion samplers, exploring diverse downstream tasks, evaluating
post-distilled models, as well as comparing performance relative to training
compute. These findings open up new pathways for the development of LDM scaling
strategies which can be employed to enhance generative capabilities within
limited inference budgets.
comment: Accepted to TMLR. Camera-ready version
♻ ☆ Negative-prompt Inversion: Fast Image Inversion for Editing with Text-guided Diffusion Models
In image editing employing diffusion models, it is crucial to preserve the
reconstruction fidelity to the original image while changing its style.
Although existing methods ensure reconstruction fidelity through optimization,
a drawback of these is the significant amount of time required for
optimization. In this paper, we propose negative-prompt inversion, a method
capable of achieving equivalent reconstruction solely through forward
propagation without optimization, thereby enabling ultrafast editing processes.
We experimentally demonstrate that the reconstruction fidelity of our method is
comparable to that of existing methods, allowing for inversion at a resolution
of 512 pixels and with 50 sampling steps within approximately 5 seconds, which
is more than 30 times faster than null-text inversion. Reduction of the
computation time by the proposed method further allows us to use a larger
number of sampling steps in diffusion models to improve the reconstruction
fidelity with a moderate increase in computation time.
comment: 20 pages, 14 figures
♻ ☆ RectifID: Personalizing Rectified Flow with Anchored Classifier Guidance NeurIPS 2024
Zhicheng Sun, Zhenhao Yang, Yang Jin, Haozhe Chi, Kun Xu, Kun Xu, Liwei Chen, Hao Jiang, Yang Song, Kun Gai, Yadong Mu
Customizing diffusion models to generate identity-preserving images from
user-provided reference images is an intriguing new problem. The prevalent
approaches typically require training on extensive domain-specific images to
achieve identity preservation, which lacks flexibility across different use
cases. To address this issue, we exploit classifier guidance, a training-free
technique that steers diffusion models using an existing classifier, for
personalized image generation. Our study shows that based on a recent rectified
flow framework, the major limitation of vanilla classifier guidance in
requiring a special classifier can be resolved with a simple fixed-point
solution, allowing flexible personalization with off-the-shelf image
discriminators. Moreover, its solving procedure proves to be stable when
anchored to a reference flow trajectory, with a convergence guarantee. The
derived method is implemented on rectified flow with different off-the-shelf
image discriminators, delivering advantageous personalization results for human
faces, live subjects, and certain objects. Code is available at
https://github.com/feifeiobama/RectifID.
comment: NeurIPS 2024
♻ ☆ Guiding Vision-Language Model Selection for Visual Question-Answering Across Tasks, Domains, and Knowledge Types
Visual Question-Answering (VQA) has become key to user experience,
particularly after improved generalization capabilities of Vision-Language
Models (VLMs). But evaluating VLMs for an application requirement using a
standardized framework in practical settings is still challenging. This paper
aims to solve that using an end-to-end framework. We present VQA360 - a novel
dataset derived from established VQA benchmarks, annotated with task types,
application domains, and knowledge types, for a comprehensive evaluation. We
also introduce GoEval, a multimodal evaluation metric developed using GPT-4o,
achieving a correlation factor of 56.71% with human judgments. Our experiments
with state-of-the-art VLMs reveal that no single model excels universally,
thus, making a right choice a key design decision. Proprietary models such as
Gemini-1.5-Pro and GPT-4o-mini generally outperform others, but open-source
models like InternVL-2-8B and CogVLM-2-Llama-3-19B also demonstrate competitive
strengths, while providing additional advantages. Our framework can also be
extended to other tasks.
comment: 8 pages + references + 6 pages of Appendix
♻ ☆ Beyond Skip Connection: Pooling and Unpooling Design for Elimination Singularities
Training deep Convolutional Neural Networks (CNNs) presents unique
challenges, including the pervasive issue of elimination singularities,
consistent deactivation of nodes leading to degenerate manifolds within the
loss landscape. These singularities impede efficient learning by disrupting
feature propagation. To mitigate this, we introduce Pool Skip, an architectural
enhancement that strategically combines a Max Pooling, a Max Unpooling, a 3
times 3 convolution, and a skip connection. This configuration helps stabilize
the training process and maintain feature integrity across layers. We also
propose the Weight Inertia hypothesis, which underpins the development of Pool
Skip, providing theoretical insights into mitigating degradation caused by
elimination singularities through dimensional and affine compensation. We
evaluate our method on a variety of benchmarks, focusing on both 2D natural and
3D medical imaging applications, including tasks such as classification and
segmentation. Our findings highlight Pool Skip's effectiveness in facilitating
more robust CNN training and improving model performance.
♻ ☆ A consensus-constrained parsimonious Gaussian mixture model for clustering hyperspectral images
The use of hyperspectral imaging to investigate food samples has grown due to
the improved performance and lower cost of instrumentation. Food engineers use
hyperspectral images to classify the type and quality of a food sample,
typically using classification methods. In order to train these methods, every
pixel in each training image needs to be labelled. Typically, computationally
cheap threshold-based approaches are used to label the pixels, and
classification methods are trained based on those labels. However,
threshold-based approaches are subjective and cannot be generalized across
hyperspectral images taken in different conditions and of different foods. Here
a consensus-constrained parsimonious Gaussian mixture model (ccPGMM) is
proposed to label pixels in hyperspectral images using a model-based clustering
approach. The ccPGMM utilizes information that is available on some pixels and
specifies constraints on those pixels belonging to the same or different
clusters while clustering the rest of the pixels in the image. A latent
variable model is used to represent the high-dimensional data in terms of a
small number of underlying latent factors. To ensure computational feasibility,
a consensus clustering approach is employed, where the data are divided into
multiple randomly selected subsets of variables and constrained clustering is
applied to each data subset; the clustering results are then consolidated
across all data subsets to provide a consensus clustering solution. The ccPGMM
approach is applied to simulated datasets and real hyperspectral images of
three types of puffed cereal, corn, rice, and wheat. Improved clustering
performance and computational efficiency are demonstrated when compared to
other current state-of-the-art approaches.
♻ ☆ Scalable Autoregressive Image Generation with Mamba
We introduce AiM, an autoregressive (AR) image generative model based on
Mamba architecture. AiM employs Mamba, a novel state-space model characterized
by its exceptional performance for long-sequence modeling with linear time
complexity, to supplant the commonly utilized Transformers in AR image
generation models, aiming to achieve both superior generation quality and
enhanced inference speed. Unlike existing methods that adapt Mamba to handle
two-dimensional signals via multi-directional scan, AiM directly utilizes the
next-token prediction paradigm for autoregressive image generation. This
approach circumvents the need for extensive modifications to enable Mamba to
learn 2D spatial representations. By implementing straightforward yet
strategically targeted modifications for visual generative tasks, we preserve
Mamba's core structure, fully exploiting its efficient long-sequence modeling
capabilities and scalability. We provide AiM models in various scales, with
parameter counts ranging from 148M to 1.3B. On the ImageNet1K 256*256
benchmark, our best AiM model achieves a FID of 2.21, surpassing all existing
AR models of comparable parameter counts and demonstrating significant
competitiveness against diffusion models, with 2 to 10 times faster inference
speed. Code is available at https://github.com/hp-l33/AiM
comment: 9 pages, 8 figures
♻ ☆ Score-Based Multimodal Autoencoder
Multimodal Variational Autoencoders (VAEs) represent a promising group of
generative models that facilitate the construction of a tractable posterior
within the latent space given multiple modalities. Previous studies have shown
that as the number of modalities increases, the generative quality of each
modality declines. In this study, we explore an alternative approach to enhance
the generative performance of multimodal VAEs by jointly modeling the latent
space of independently trained unimodal VAEs using score-based models (SBMs).
The role of the SBM is to enforce multimodal coherence by learning the
correlation among the latent variables. Consequently, our model combines a
better generative quality of unimodal VAEs with coherent integration across
different modalities using the latent score-based model. In addition, our
approach provides the best unconditional coherence.
♻ ☆ CMRNext: Camera to LiDAR Matching in the Wild for Localization and Extrinsic Calibration
LiDARs are widely used for mapping and localization in dynamic environments.
However, their high cost limits their widespread adoption. On the other hand,
monocular localization in LiDAR maps using inexpensive cameras is a
cost-effective alternative for large-scale deployment. Nevertheless, most
existing approaches struggle to generalize to new sensor setups and
environments, requiring retraining or fine-tuning. In this paper, we present
CMRNext, a novel approach for camera-LIDAR matching that is independent of
sensor-specific parameters, generalizable, and can be used in the wild for
monocular localization in LiDAR maps and camera-LiDAR extrinsic calibration.
CMRNext exploits recent advances in deep neural networks for matching
cross-modal data and standard geometric techniques for robust pose estimation.
We reformulate the point-pixel matching problem as an optical flow estimation
problem and solve the Perspective-n-Point problem based on the resulting
correspondences to find the relative pose between the camera and the LiDAR
point cloud. We extensively evaluate CMRNext on six different robotic
platforms, including three publicly available datasets and three in-house
robots. Our experimental evaluations demonstrate that CMRNext outperforms
existing approaches on both tasks and effectively generalizes to previously
unseen environments and sensor setups in a zero-shot manner. We make the code
and pre-trained models publicly available at http://cmrnext.cs.uni-freiburg.de .
♻ ☆ Addressing Attribute Leakages in Diffusion-based Image Editing without Training
Diffusion models have become a cornerstone in image editing, offering
flexibility with language prompts and source images. However, a key challenge
is attribute leakage, where unintended modifications occur in non-target
regions or within target regions due to attribute interference. Existing
methods often suffer from leakage due to naive text embeddings and inadequate
handling of End-of-Sequence (EOS) token embeddings. To address this, we propose
ALE-Edit (Attribute-leakage-free editing), a novel framework to minimize
attribute leakage with three components: (1) Object-Restricted Embeddings (ORE)
to localize object-specific attributes in text embeddings, (2) Region-Guided
Blending for Cross-Attention Masking (RGB-CAM) to align attention with target
regions, and (3) Background Blending (BB) to preserve non-edited regions.
Additionally, we introduce ALE-Bench, a benchmark for evaluating attribute
leakage with new metrics for target-external and target-internal leakage.
Experiments demonstrate that our framework significantly reduces attribute
leakage while maintaining high editing quality, providing an efficient and
tuning-free solution for multi-object image editing.
♻ ☆ Enhancing Vision-Language Model Pre-training with Image-text Pair Pruning Based on Word Frequency
We propose Word-Frequency-based Image-Text Pair Pruning (WFPP), a novel data
pruning method that improves the efficiency of VLMs. Unlike MetaCLIP, our
method does not need metadata for pruning, but selects text-image pairs to
prune based on the content of the text. Specifically, WFPP prunes text-image
pairs containing high-frequency words across the entire training dataset. The
effect of WFPP is to reduce the dominance of frequent words. The result a
better balanced word-frequency distribution in the dataset, which is known to
improve the training of word embedding models. After pre-training on the pruned
subset, we fine-tuned the model on the entire dataset for one additional epoch
to achieve better performance. Our experiments demonstrate that applying WFPP
when training a CLIP model improves performance on a wide range of downstream
tasks. WFPP also provides the advantage of speeding up pre-training by using
fewer samples. Additionally, we analyze the training data before and after
pruning to visualize how WFPP changes the balance of word frequencies. We hope
our work encourages researchers to consider the distribution of words in the
training data when pre-training VLMs, not limited to CLIP.
♻ ☆ Video-XL: Extra-Long Vision Language Model for Hour-Scale Video Understanding
Long video understanding poses a significant challenge for current
Multi-modal Large Language Models (MLLMs). Notably, the MLLMs are constrained
by their limited context lengths and the substantial costs while processing
long videos. Although several existing methods attempt to reduce visual tokens,
their strategies encounter severe bottleneck, restricting MLLMs' ability to
perceive fine-grained visual details. In this work, we propose Video-XL, a
novel approach that leverages MLLMs' inherent key-value (KV) sparsification
capacity to condense the visual input. Specifically, we introduce a new special
token, the Visual Summarization Token (VST), for each interval of the video,
which summarizes the visual information within the interval as its associated
KV. The VST module is trained by instruction fine-tuning, where two optimizing
strategies are offered. 1.Curriculum learning, where VST learns to make small
(easy) and large compression (hard) progressively. 2. Composite data curation,
which integrates single-image, multi-image, and synthetic data to overcome the
scarcity of long-video instruction data. The compression quality is further
improved by dynamic compression, which customizes compression granularity based
on the information density of different video intervals. Video-XL's
effectiveness is verified from three aspects. First, it achieves a superior
long-video understanding capability, outperforming state-of-the-art models of
comparable sizes across multiple popular benchmarks. Second, it effectively
preserves video information, with minimal compression loss even at 16x
compression ratio. Third, it realizes outstanding cost-effectiveness, enabling
high-quality processing of thousands of frames on a single A100 GPU.
♻ ☆ 3D-Mem: 3D Scene Memory for Embodied Exploration and Reasoning
Constructing compact and informative 3D scene representations is essential
for effective embodied exploration and reasoning, especially in complex
environments over extended periods. Existing representations, such as
object-centric 3D scene graphs, oversimplify spatial relationships by modeling
scenes as isolated objects with restrictive textual relationships, making it
difficult to address queries requiring nuanced spatial understanding. Moreover,
these representations lack natural mechanisms for active exploration and memory
management, hindering their application to lifelong autonomy. In this work, we
propose 3D-Mem, a novel 3D scene memory framework for embodied agents. 3D-Mem
employs informative multi-view images, termed Memory Snapshots, to represent
the scene and capture rich visual information of explored regions. It further
integrates frontier-based exploration by introducing Frontier
Snapshots-glimpses of unexplored areas-enabling agents to make informed
decisions by considering both known and potential new information. To support
lifelong memory in active exploration settings, we present an incremental
construction pipeline for 3D-Mem, as well as a memory retrieval technique for
memory management. Experimental results on three benchmarks demonstrate that
3D-Mem significantly enhances agents' exploration and reasoning capabilities in
3D environments, highlighting its potential for advancing applications in
embodied AI.
♻ ☆ XAMI -- A Benchmark Dataset for Artefact Detection in XMM-Newton Optical Images SP
Reflected or scattered light produce artefacts in astronomical observations
that can negatively impact the scientific study. Hence, automated detection of
these artefacts is highly beneficial, especially with the increasing amounts of
data gathered. Machine learning methods are well-suited to this problem, but
currently there is a lack of annotated data to train such approaches to detect
artefacts in astronomical observations. In this work, we present a dataset of
images from the XMM-Newton space telescope Optical Monitoring camera showing
different types of artefacts. We hand-annotated a sample of 1000 images with
artefacts which we use to train automated ML methods. We further demonstrate
techniques tailored for accurate detection and masking of artefacts using
instance segmentation. We adopt a hybrid approach, combining knowledge from
both convolutional neural networks (CNNs) and transformer-based models and use
their advantages in segmentation. The presented method and dataset will advance
artefact detection in astronomical observations by providing a reproducible
baseline. All code and data are made available
(https://github.com/ESA-Datalabs/XAMI-model and
https://github.com/ESA-Datalabs/XAMI-dataset).
comment: Accepted for oral presentation at SPAICE 2024
♻ ☆ CrossSDF: 3D Reconstruction of Thin Structures From Cross-Sections
Thomas Walker, Salvatore Esposito, Daniel Rebain, Amir Vaxman, Arno Onken, Changjian Li, Oisin Mac Aodha
Reconstructing complex structures from planar cross-sections is a challenging
problem, with wide-reaching applications in medical imaging, manufacturing, and
topography. Out-of-the-box point cloud reconstruction methods can often fail
due to the data sparsity between slicing planes, while current bespoke methods
struggle to reconstruct thin geometric structures and preserve topological
continuity. This is important for medical applications where thin vessel
structures are present in CT and MRI scans. This paper introduces CrossSDF, a
novel approach for extracting a 3D signed distance field from 2D signed
distances generated from planar contours. Our approach makes the training of
neural SDFs contour-aware by using losses designed for the case where geometry
is known within 2D slices. Our results demonstrate a significant improvement
over existing methods, effectively reconstructing thin structures and producing
accurate 3D models without the interpolation artifacts or over-smoothing of
prior approaches.
♻ ☆ Maia: A Real-time Non-Verbal Chat for Human-AI Interaction
Modeling face-to-face communication in computer vision, which focuses on
recognizing and analyzing nonverbal cues and behaviors during interactions,
serves as the foundation for our proposed alternative to text-based Human-AI
interaction. By leveraging nonverbal visual communication, through facial
expressions, head and body movements, we aim to enhance engagement and capture
the user's attention through a novel improvisational element, that goes beyond
mirroring gestures. Our goal is to track and analyze facial expressions, and
other nonverbal cues in real-time, and use this information to build models
that can predict and understand human behavior. Operating in real-time and
requiring minimal computational resources, our approach signifies a major leap
forward in making AI interactions more natural and accessible. We offer three
different complementary approaches, based on retrieval, statistical, and deep
learning techniques. A key novelty of our work is the integration of an
artistic component atop an efficient human-computer interaction system, using
art as a medium to transmit emotions. Our approach is not art-specific and can
be adapted to various paintings, animations, and avatars. In our experiments,
we compare state-of-the-art diffusion models as mediums for emotion translation
in 2D, and our 3D avatar, Maia, that we introduce in this work, with not just
facial movements but also body motions for a more natural and engaging
experience. We demonstrate the effectiveness of our approach in translating
AI-generated emotions into human-relatable expressions, through both human and
automatic evaluation procedures, highlighting its potential to significantly
enhance the naturalness and engagement of Human-AI interactions across various
applications.
comment: 11 pages, 7 figures
♻ ☆ pfl-research: simulation framework for accelerating research in Private Federated Learning
Filip Granqvist, Congzheng Song, Áine Cahill, Rogier van Dalen, Martin Pelikan, Yi Sheng Chan, Xiaojun Feng, Natarajan Krishnaswami, Vojta Jina, Mona Chitnis
Federated learning (FL) is an emerging machine learning (ML) training
paradigm where clients own their data and collaborate to train a global model,
without revealing any data to the server and other participants. Researchers
commonly perform experiments in a simulation environment to quickly iterate on
ideas. However, existing open-source tools do not offer the efficiency required
to simulate FL on larger and more realistic FL datasets. We introduce
pfl-research, a fast, modular, and easy-to-use Python framework for simulating
FL. It supports TensorFlow, PyTorch, and non-neural network models, and is
tightly integrated with state-of-the-art privacy algorithms. We study the speed
of open-source FL frameworks and show that pfl-research is 7-72$\times$ faster
than alternative open-source frameworks on common cross-device setups. Such
speedup will significantly boost the productivity of the FL research community
and enable testing hypotheses on realistic FL datasets that were previously too
resource intensive. We release a suite of benchmarks that evaluates an
algorithm's overall performance on a diverse set of realistic scenarios. The
code is available on GitHub at https://github.com/apple/pfl-research.
♻ ☆ HiFiVFS: High Fidelity Video Face Swapping
Face swapping aims to generate results that combine the identity from the
source with attributes from the target. Existing methods primarily focus on
image-based face swapping. When processing videos, each frame is handled
independently, making it difficult to ensure temporal stability. From a model
perspective, face swapping is gradually shifting from generative adversarial
networks (GANs) to diffusion models (DMs), as DMs have been shown to possess
stronger generative capabilities. Current diffusion-based approaches often
employ inpainting techniques, which struggle to preserve fine-grained
attributes like lighting and makeup. To address these challenges, we propose a
high fidelity video face swapping (HiFiVFS) framework, which leverages the
strong generative capability and temporal prior of Stable Video Diffusion
(SVD). We build a fine-grained attribute module to extract
identity-disentangled and fine-grained attribute features through identity
desensitization and adversarial learning. Additionally, We introduce detailed
identity injection to further enhance identity similarity. Extensive
experiments demonstrate that our method achieves state-of-the-art (SOTA) in
video face swapping, both qualitatively and quantitatively.
♻ ☆ RoboMatrix: A Skill-centric Hierarchical Framework for Scalable Robot Task Planning and Execution in Open-World
Weixin Mao, Weiheng Zhong, Zhou Jiang, Dong Fang, Zhongyue Zhang, Zihan Lan, Fan Jia, Tiancai Wang, Haoqiang Fan, Osamu Yoshie
Existing policy learning methods predominantly adopt the task-centric
paradigm, necessitating the collection of task data in an end-to-end manner.
Consequently, the learned policy tends to fail to tackle novel tasks. Moreover,
it is hard to localize the errors for a complex task with multiple stages due
to end-to-end learning. To address these challenges, we propose RoboMatrix, a
skill-centric and hierarchical framework for scalable task planning and
execution. We first introduce a novel skill-centric paradigm that extracts the
common meta-skills from different complex tasks. This allows for the capture of
embodied demonstrations through a skill-centric approach, enabling the
completion of open-world tasks by combining learned meta-skills. To fully
leverage meta-skills, we further develop a hierarchical framework that
decouples complex robot tasks into three interconnected layers: (1) a
high-level modular scheduling layer; (2) a middle-level skill layer; and (3) a
low-level hardware layer. Experimental results illustrate that our
skill-centric and hierarchical framework achieves remarkable generalization
performance across novel objects, scenes, tasks, and embodiments. This
framework offers a novel solution for robot task planning and execution in
open-world scenarios. Our software and hardware are available at
https://github.com/WayneMao/RoboMatrix.
comment: 17 pages, 16 figures
♻ ☆ MOANA: Multi-Radar Dataset for Maritime Odometry and Autonomous Navigation Application
Hyesu Jang, Wooseong Yang, Hanguen Kim, Dongje Lee, Yongjin Kim, Jinbum Park, Minsoo Jeon, Jaeseong Koh, Yejin Kang, Minwoo Jung, Sangwoo Jung, Ayoung Kim
Maritime environmental sensing requires overcoming challenges from complex
conditions such as harsh weather, platform perturbations, large dynamic
objects, and the requirement for long detection ranges. While cameras and LiDAR
are commonly used in ground vehicle navigation, their applicability in maritime
settings is limited by range constraints and hardware maintenance issues. Radar
sensors, however, offer robust long-range detection capabilities and resilience
to physical contamination from weather and saline conditions, making it a
powerful sensor for maritime navigation. Among various radar types, X-band
radar (e.g., marine radar) is widely employed for maritime vessel navigation,
providing effective long-range detection essential for situational awareness
and collision avoidance. Nevertheless, it exhibits limitations during berthing
operations where close-range object detection is critical. To address this
shortcoming, we incorporate W-band radar (e.g., Navtech imaging radar), which
excels in detecting nearby objects with a higher update rate. We present a
comprehensive maritime sensor dataset featuring multi-range detection
capabilities. This dataset integrates short-range LiDAR data, medium-range
W-band radar data, and long-range X-band radar data into a unified framework.
Additionally, it includes object labels for oceanic object detection usage,
derived from radar and stereo camera images. The dataset comprises seven
sequences collected from diverse regions with varying levels of estimation
difficulty, ranging from easy to challenging, and includes common locations
suitable for global localization tasks. This dataset serves as a valuable
resource for advancing research in place recognition, odometry estimation,
SLAM, object detection, and dynamic object elimination within maritime
environments. Dataset can be found in following link:
https://sites.google.com/view/rpmmoana
comment: We encountered a regularation issue from Singapore government and
paused the release for a while. We are working on resolving the issue. Thank
you for your understanding
♻ ☆ Local-to-Global Self-Supervised Representation Learning for Diabetic Retinopathy Grading
Artificial intelligence algorithms have demonstrated their image
classification and segmentation ability in the past decade. However, artificial
intelligence algorithms perform less for actual clinical data than those used
for simulations. This research aims to present a novel hybrid learning model
using self-supervised learning and knowledge distillation, which can achieve
sufficient generalization and robustness. The self-attention mechanism and
tokens employed in ViT, besides the local-to-global learning approach used in
the hybrid model, enable the proposed algorithm to extract a high-dimensional
and high-quality feature space from images. To demonstrate the proposed neural
network's capability in classifying and extracting feature spaces from medical
images, we use it on a dataset of Diabetic Retinopathy images, specifically the
EyePACS dataset. This dataset is more complex structurally and challenging
regarding damaged areas than other medical images. For the first time in this
study, self-supervised learning and knowledge distillation are used to classify
this dataset. In our algorithm, for the first time among all self-supervised
learning and knowledge distillation models, the test dataset is 50% larger than
the training dataset. Unlike many studies, we have not removed any images from
the dataset. Finally, our algorithm achieved an accuracy of 79.1% in the linear
classifier and 74.36% in the k-NN algorithm for multiclass classification.
Compared to a similar state-of-the-art model, our results achieved higher
accuracy and more effective representation spaces.
♻ ☆ Vision Language Modeling of Content, Distortion and Appearance for Image Quality Assessment
The visual quality of an image is confounded by a number of intertwined
factors including its semantic content, distortion characteristics and
appearance properties such as brightness, contrast, sharpness, and
colourfulness. Distilling high level knowledge about all these quality bearing
attributes is crucial for developing objective Image Quality Assessment
(IQA).While existing solutions have modeled some of these aspects, a
comprehensive solution that involves all these important quality related
attributes has not yet been developed. In this paper, we present a new blind
IQA (BIQA) model termed Self-supervision and Vision-Language supervision Image
QUality Evaluator (SLIQUE) that features a joint vision-language and visual
contrastive representation learning framework for acquiring high level
knowledge about the images semantic contents, distortion characteristics and
appearance properties for IQA. For training SLIQUE, we have developed a
systematic approach to constructing a first of its kind large image database
annotated with all three categories of quality relevant texts. The Text
Annotated Distortion, Appearance and Content (TADAC) database has over 1.6
million images annotated with textual descriptions of their semantic contents,
distortion characteristics and appearance properties. The method for
constructing TADAC and the database itself will be particularly useful for
exploiting vision-language modeling for advanced IQA applications. Extensive
experimental results show that SLIQUE has superior performances over state of
the art, demonstrating the soundness of its design principle and the
effectiveness of its implementation.
comment: This paper has been accepted for publication in Journal of Selected
Topics in Signal Processing, \c{opyright} 2024 IEEE. Personal use is
permitted. For other uses, permission must be obtained from IEEE
♻ ☆ Detecting and Corrupting Convolution-based Unlearnable Examples AAAI 2025
Convolution-based unlearnable examples (UEs) employ class-wise multiplicative
convolutional noise to training samples, severely compromising model
performance. This fire-new type of UEs have successfully countered all defense
mechanisms against UEs. The failure of such defenses can be attributed to the
absence of norm constraints on convolutional noise, leading to severe blurring
of image features. To address this, we first design an Edge Pixel-based
Detector (EPD) to identify convolution-based UEs. Upon detection of them, we
propose the first defense scheme against convolution-based UEs, COrrupting
these samples via random matrix multiplication by employing bilinear
INterpolation (COIN) such that disrupting the distribution of class-wise
multiplicative noise. To evaluate the generalization of our proposed COIN, we
newly design two convolution-based UEs called VUDA and HUDA to expand the scope
of convolution-based UEs. Extensive experiments demonstrate the effectiveness
of detection scheme EPD and that our defense COIN outperforms 11
state-of-the-art (SOTA) defenses, achieving a significant improvement on the
CIFAR and ImageNet datasets.
comment: AAAI 2025
♻ ☆ IF-MDM: Implicit Face Motion Diffusion Model for High-Fidelity Realtime Talking Head Generation
We introduce a novel approach for high-resolution talking head generation
from a single image and audio input. Prior methods using explicit face models,
like 3D morphable models (3DMM) and facial landmarks, often fall short in
generating high-fidelity videos due to their lack of appearance-aware motion
representation. While generative approaches such as video diffusion models
achieve high video quality, their slow processing speeds limit practical
application. Our proposed model, Implicit Face Motion Diffusion Model (IF-MDM),
employs implicit motion to encode human faces into appearance-aware compressed
facial latents, enhancing video generation. Although implicit motion lacks the
spatial disentanglement of explicit models, which complicates alignment with
subtle lip movements, we introduce motion statistics to help capture
fine-grained motion information. Additionally, our model provides motion
controllability to optimize the trade-off between motion intensity and visual
quality during inference. IF-MDM supports real-time generation of 512x512
resolution videos at up to 45 frames per second (fps). Extensive evaluations
demonstrate its superior performance over existing diffusion and explicit face
models. The code will be released publicly, available alongside supplementary
materials. The video results can be found on
https://bit.ly/ifmdm_supplementary.
comment: Underreview
♻ ☆ PVG: Progressive Vision Graph for Vision Recognition ACM MM 2023
Convolution-based and Transformer-based vision backbone networks process
images into the grid or sequence structures, respectively, which are inflexible
for capturing irregular objects. Though Vision GNN (ViG) adopts graph-level
features for complex images, it has some issues, such as inaccurate neighbor
node selection, expensive node information aggregation calculation, and
over-smoothing in the deep layers. To address the above problems, we propose a
Progressive Vision Graph (PVG) architecture for vision recognition task.
Compared with previous works, PVG contains three main components: 1)
Progressively Separated Graph Construction (PSGC) to introduce second-order
similarity by gradually increasing the channel of the global graph branch and
decreasing the channel of local branch as the layer deepens; 2) Neighbor nodes
information aggregation and update module by using Max pooling and mathematical
Expectation (MaxE) to aggregate rich neighbor information; 3) Graph error
Linear Unit (GraphLU) to enhance low-value information in a relaxed form to
reduce the compression of image detail information for alleviating the
over-smoothing. Extensive experiments on mainstream benchmarks demonstrate the
superiority of PVG over state-of-the-art methods, e.g., our PVG-S obtains 83.0%
Top-1 accuracy on ImageNet-1K that surpasses GNN-based ViG-S by +0.9 with the
parameters reduced by 18.5%, while the largest PVG-B obtains 84.2% that has
+0.5 improvement than ViG-B. Furthermore, our PVG-S obtains +1.3 box AP and
+0.4 mask AP gains than ViG-S on COCO dataset.
comment: Accepted by ACM MM 2023
♻ ☆ TACO: Learning Multi-modal Action Models with Synthetic Chains-of-Thought-and-Action
Zixian Ma, Jianguo Zhang, Zhiwei Liu, Jieyu Zhang, Juntao Tan, Manli Shu, Juan Carlos Niebles, Shelby Heinecke, Huan Wang, Caiming Xiong, Ranjay Krishna, Silvio Savarese
While open-source multi-modal language models perform well on simple question
answering tasks, they often fail on complex questions that require multiple
capabilities, such as fine-grained recognition, visual grounding, and
reasoning, and that demand multi-step solutions. We present TACO, a family of
multi-modal large action models designed to improve performance on such
complex, multi-step, and multi-modal tasks. During inference, TACO produces
chains-of-thought-and-action (CoTA), executes intermediate steps by invoking
external tools such as OCR, depth estimation and calculator, then integrates
both the thoughts and action outputs to produce coherent responses. To train
TACO, we create a large dataset of over 1M synthetic CoTA traces generated with
GPT-4o and Python programs. We then experiment with various data filtering and
mixing techniques and obtain a final subset of 293K high-quality CoTA examples.
This dataset enables TACO to learn complex reasoning and action paths,
surpassing existing models trained on instruction tuning data with only direct
answers. Our model TACO outperforms the instruction-tuned baseline across 8
benchmarks, achieving a 3.6% improvement on average, with gains of up to 15% in
MMVet tasks involving OCR, mathematical reasoning, and spatial reasoning.
Training on high-quality CoTA traces sets a new standard for complex
multi-modal reasoning, highlighting the need for structured, multi-step
instruction tuning in advancing open-source mutli-modal models' capabilities.
♻ ☆ NewMove: Customizing text-to-video models with novel motions
We introduce an approach for augmenting text-to-video generation models with
customized motions, extending their capabilities beyond the motions depicted in
the original training data. By leveraging a few video samples demonstrating
specific movements as input, our method learns and generalizes the input motion
patterns for diverse, text-specified scenarios. Our contributions are
threefold. First, to achieve our results, we finetune an existing text-to-video
model to learn a novel mapping between the depicted motion in the input
examples to a new unique token. To avoid overfitting to the new custom motion,
we introduce an approach for regularization over videos. Second, by leveraging
the motion priors in a pretrained model, our method can produce novel videos
featuring multiple people doing the custom motion, and can invoke the motion in
combination with other motions. Furthermore, our approach extends to the
multimodal customization of motion and appearance of individualized subjects,
enabling the generation of videos featuring unique characters and distinct
motions. Third, to validate our method, we introduce an approach for
quantitatively evaluating the learned custom motion and perform a systematic
ablation study. We show that our method significantly outperforms prior
appearance-based customization approaches when extended to the motion
customization task.
comment: Project page: this website
https://joaanna.github.io/customizing_motion/
♻ ☆ HAAT: Hybrid Attention Aggregation Transformer for Image Super-Resolution
In the research area of image super-resolution, Swin-transformer-based models
are favored for their global spatial modeling and shifting window attention
mechanism. However, existing methods often limit self-attention to non
overlapping windows to cut costs and ignore the useful information that exists
across channels. To address this issue, this paper introduces a novel model,
the Hybrid Attention Aggregation Transformer (HAAT), designed to better
leverage feature information. HAAT is constructed by integrating
Swin-Dense-Residual-Connected Blocks (SDRCB) with Hybrid Grid Attention Blocks
(HGAB). SDRCB expands the receptive field while maintaining a streamlined
architecture, resulting in enhanced performance. HGAB incorporates channel
attention, sparse attention, and window attention to improve nonlocal feature
fusion and achieve more visually compelling results. Experimental evaluations
demonstrate that HAAT surpasses state-of-the-art methods on benchmark datasets.
Keywords: Image super-resolution, Computer vision, Attention mechanism,
Transformer
comment: 6 pages, 2 figures, 1 table
♻ ☆ Distribution-Level Feature Distancing for Machine Unlearning: Towards a Better Trade-off Between Model Utility and Forgetting AAAI 2025
With the explosive growth of deep learning applications and increasing
privacy concerns, the right to be forgotten has become a critical requirement
in various AI industries. For example, given a facial recognition system, some
individuals may wish to remove their personal data that might have been used in
the training phase. Unfortunately, deep neural networks sometimes unexpectedly
leak personal identities, making this removal challenging. While recent machine
unlearning algorithms aim to enable models to forget such data, we observe an
unintended utility drop, termed correlation collapse, where these algorithms
inadvertently weaken the essential correlations between image features and true
labels during the forgetting process. To address this challenge, we propose
Distribution-Level Feature Distancing (DLFD), a novel method that efficiently
forgets instances while preserving task-relevant feature correlations. Our
method synthesizes data samples by optimizing the feature distribution to be
distinctly different from that of forget samples, achieving effective results
within a single training epoch. Through extensive experiments on facial
recognition datasets, we demonstrate that our approach significantly
outperforms state-of-the-art machine unlearning methods in both forgetting
performance and model utility preservation.
comment: 10 pages, 6 figures, AAAI 2025 camera ready version
♻ ☆ Trusted Unified Feature-Neighborhood Dynamics for Multi-View Classification AAAI 2025
Haojian Huang, Chuanyu Qin, Zhe Liu, Kaijing Ma, Jin Chen, Han Fang, Chao Ban, Hao Sun, Zhongjiang He
Multi-view classification (MVC) faces inherent challenges due to domain gaps
and inconsistencies across different views, often resulting in uncertainties
during the fusion process. While Evidential Deep Learning (EDL) has been
effective in addressing view uncertainty, existing methods predominantly rely
on the Dempster-Shafer combination rule, which is sensitive to conflicting
evidence and often neglects the critical role of neighborhood structures within
multi-view data. To address these limitations, we propose a Trusted Unified
Feature-NEighborhood Dynamics (TUNED) model for robust MVC. This method
effectively integrates local and global feature-neighborhood (F-N) structures
for robust decision-making. Specifically, we begin by extracting local F-N
structures within each view. To further mitigate potential uncertainties and
conflicts in multi-view fusion, we employ a selective Markov random field that
adaptively manages cross-view neighborhood dependencies. Additionally, we
employ a shared parameterized evidence extractor that learns global consensus
conditioned on local F-N structures, thereby enhancing the global integration
of multi-view features. Experiments on benchmark datasets show that our method
improves accuracy and robustness over existing approaches, particularly in
scenarios with high uncertainty and conflicting views. The code will be made
available at https://github.com/JethroJames/TUNED.
comment: Accepted to AAAI 2025
♻ ☆ Color-Oriented Redundancy Reduction in Dataset Distillation NeurIPS 2024
Dataset Distillation (DD) is designed to generate condensed representations
of extensive image datasets, enhancing training efficiency. Despite recent
advances, there remains considerable potential for improvement, particularly in
addressing the notable redundancy within the color space of distilled images.
In this paper, we propose AutoPalette, a framework that minimizes color
redundancy at the individual image and overall dataset levels, respectively. At
the image level, we employ a palette network, a specialized neural network, to
dynamically allocate colors from a reduced color space to each pixel. The
palette network identifies essential areas in synthetic images for model
training and consequently assigns more unique colors to them. At the dataset
level, we develop a color-guided initialization strategy to minimize redundancy
among images. Representative images with the least replicated color patterns
are selected based on the information gain. A comprehensive performance study
involving various datasets and evaluation scenarios is conducted, demonstrating
the superior performance of our proposed color-aware DD compared to existing DD
methods. The code is available at
\url{https://github.com/KeViNYuAn0314/AutoPalette}.
comment: 38th Conference on Neural Information Processing Systems (NeurIPS
2024)
♻ ☆ DocKylin: A Large Multimodal Model for Visual Document Understanding with Efficient Visual Slimming AAAI 2025
Current multimodal large language models (MLLMs) face significant challenges
in visual document understanding (VDU) tasks due to the high resolution, dense
text, and complex layouts typical of document images. These characteristics
demand a high level of detail perception ability from MLLMs. While increasing
input resolution improves detail perception capability, it also leads to longer
sequences of visual tokens, increasing computational costs and straining the
models' ability to handle long contexts. To address these challenges, we
introduce DocKylin, a document-centric MLLM that performs visual content
slimming at both the pixel and token levels, thereby reducing token sequence
length in VDU scenarios. We introduce an Adaptive Pixel Slimming (APS)
preprocessing module to perform pixel-level slimming, increasing the proportion
of informative pixels. Moreover, we propose a novel Dynamic Token Slimming
(DTS) module to conduct token-level slimming, filtering essential tokens and
removing others to adaptively create a more compact visual sequence.
Experiments demonstrate DocKylin's promising performance across various VDU
benchmarks and the effectiveness of each component.
comment: Accepted by AAAI 2025
♻ ☆ LiGAR: LiDAR-Guided Hierarchical Transformer for Multi-Modal Group Activity Recognition WACV 2025
Group Activity Recognition (GAR) remains challenging in computer vision due
to the complex nature of multi-agent interactions. This paper introduces LiGAR,
a LIDAR-Guided Hierarchical Transformer for Multi-Modal Group Activity
Recognition. LiGAR leverages LiDAR data as a structural backbone to guide the
processing of visual and textual information, enabling robust handling of
occlusions and complex spatial arrangements. Our framework incorporates a
Multi-Scale LIDAR Transformer, Cross-Modal Guided Attention, and an Adaptive
Fusion Module to integrate multi-modal data at different semantic levels
effectively. LiGAR's hierarchical architecture captures group activities at
various granularities, from individual actions to scene-level dynamics.
Extensive experiments on the JRDB-PAR, Volleyball, and NBA datasets demonstrate
LiGAR's superior performance, achieving state-of-the-art results with
improvements of up to 10.6% in F1-score on JRDB-PAR and 5.9% in Mean Per Class
Accuracy on the NBA dataset. Notably, LiGAR maintains high performance even
when LiDAR data is unavailable during inference, showcasing its adaptability.
Our ablation studies highlight the significant contributions of each component
and the effectiveness of our multi-modal, multi-scale approach in advancing the
field of group activity recognition.
comment: Accepted at WACV 2025; 14 pages, 4 figures, 10 tables
♻ ☆ CC-OCR: A Comprehensive and Challenging OCR Benchmark for Evaluating Large Multimodal Models in Literacy
Zhibo Yang, Jun Tang, Zhaohai Li, Pengfei Wang, Jianqiang Wan, Humen Zhong, Xuejing Liu, Mingkun Yang, Peng Wang, Shuai Bai, LianWen Jin, Junyang Lin
Large Multimodal Models (LMMs) have demonstrated impressive performance in
recognizing document images with natural language instructions. However, it
remains unclear to what extent capabilities in literacy with rich structure and
fine-grained visual challenges. The current landscape lacks a comprehensive
benchmark to effectively measure the literate capabilities of LMMs. Existing
benchmarks are often limited by narrow scenarios and specified tasks. To this
end, we introduce CC-OCR, a comprehensive benchmark that possesses a diverse
range of scenarios, tasks, and challenges. CC-OCR comprises four OCR-centric
tracks: multi-scene text reading, multilingual text reading, document parsing,
and key information extraction. It includes 39 subsets with 7,058 full
annotated images, of which 41% are sourced from real applications, and released
for the first time. We evaluate nine prominent LMMs and reveal both the
strengths and weaknesses of these models, particularly in text grounding,
multi-orientation, and hallucination of repetition. CC-OCR aims to
comprehensively evaluate the capabilities of LMMs on OCR-centered tasks,
facilitating continued progress in this crucial area.
comment: 23 pages, 14 figures; The code will be released soon
♻ ☆ HybridGS: Decoupling Transients and Statics with 2D and 3D Gaussian Splatting
Generating high-quality novel view renderings of 3D Gaussian Splatting (3DGS)
in scenes featuring transient objects is challenging. We propose a novel hybrid
representation, termed as HybridGS, using 2D Gaussians for transient objects
per image and maintaining traditional 3D Gaussians for the whole static scenes.
Note that, the 3DGS itself is better suited for modeling static scenes that
assume multi-view consistency, but the transient objects appear occasionally
and do not adhere to the assumption, thus we model them as planar objects from
a single view, represented with 2D Gaussians. Our novel representation
decomposes the scene from the perspective of fundamental viewpoint consistency,
making it more reasonable. Additionally, we present a novel multi-view
regulated supervision method for 3DGS that leverages information from
co-visible regions, further enhancing the distinctions between the transients
and statics. Then, we propose a straightforward yet effective multi-stage
training strategy to ensure robust training and high-quality view synthesis
across various settings. Experiments on benchmark datasets show our
state-of-the-art performance of novel view synthesis in both indoor and outdoor
scenes, even in the presence of distracting elements.
comment: Project page: https://gujiaqivadin.github.io/hybridgs/
♻ ☆ Attend and Enrich: Enhanced Visual Prompt for Zero-Shot Learning
Zero-shot learning (ZSL) endeavors to transfer knowledge from seen categories
to recognize unseen categories, which mostly relies on the semantic-visual
interactions between image and attribute tokens. Recently, prompt learning has
emerged in ZSL and demonstrated significant potential as it allows the
zero-shot transfer of diverse visual concepts to downstream tasks. However,
current methods explore the fixed adaption of learnable prompt on seen domains,
which makes them over-emphasize the primary visual features observed during
training, limiting their generalization capabilities to unseen domains. In this
work, we propose AENet, which endows semantic information into the visual
prompt to distill semantic-enhanced prompt for visual representation
enrichment, enabling effective knowledge transfer for ZSL. AENet comprises two
key steps: 1) exploring the concept-harmonized tokens for the visual and
attribute modalities, grounded on the modal-sharing token that represents
consistent visual-semantic concepts; and 2) yielding semantic-enhanced prompt
via the visual residual refinement unit with attribute consistency supervision.
These are further integrated with primary visual features to attend to
semantic-related information for visual enhancement, thus strengthening
transferable ability. Experimental results on three benchmarks show that our
AENet outperforms existing state-of-the-art ZSL methods. The code is provided
in the zip file of supplementary materials.
♻ ☆ On Representation Learning with Feedback
This note complements the author's recent paper "Robust representation
learning with feedback for single image deraining" by providing heuristically
theoretical explanations on the mechanism of representation learning with
feedback, namely an essential merit of the works presented in this recent
article. This note facilitates understanding of key points in the mechanism of
representation learning with feedback.
♻ ☆ SAG-ViT: A Scale-Aware, High-Fidelity Patching Approach with Graph Attention for Vision Transformers
Image classification is a computer vision task where a model analyzes an
image to categorize it into a specific label. Vision Transformers (ViT) improve
this task by leveraging self-attention to capture complex patterns and long
range relationships between image patches. However, a key challenge for ViTs is
efficiently incorporating multiscale feature representations, which is inherent
in CNNs through their hierarchical structure. In this paper, we introduce the
Scale-Aware Graph Attention Vision Transformer (SAG-ViT), a novel framework
that addresses this challenge by integrating multi-scale features. Using
EfficientNet as a backbone, the model extracts multi-scale feature maps, which
are divided into patches to preserve semantic information. These patches are
organized into a graph based on spatial and feature similarities, with a Graph
Attention Network (GAT) refining the node embeddings. Finally, a Transformer
encoder captures long-range dependencies and complex interactions. The SAG-ViT
is evaluated on benchmark datasets, demonstrating its effectiveness in
enhancing image classification performance. Our code and weights are publicly
available at https://github.com/shravan-18/SAG-ViT
comment: 10 pages, 4 figures, 3 tables
♻ ☆ REBEL: Reinforcement Learning via Regressing Relative Rewards
Zhaolin Gao, Jonathan D. Chang, Wenhao Zhan, Owen Oertell, Gokul Swamy, Kianté Brantley, Thorsten Joachims, J. Andrew Bagnell, Jason D. Lee, Wen Sun
While originally developed for continuous control problems, Proximal Policy
Optimization (PPO) has emerged as the work-horse of a variety of reinforcement
learning (RL) applications, including the fine-tuning of generative models.
Unfortunately, PPO requires multiple heuristics to enable stable convergence
(e.g. value networks, clipping), and is notorious for its sensitivity to the
precise implementation of these components. In response, we take a step back
and ask what a minimalist RL algorithm for the era of generative models would
look like. We propose REBEL, an algorithm that cleanly reduces the problem of
policy optimization to regressing the relative reward between two completions
to a prompt in terms of the policy, enabling strikingly lightweight
implementation. In theory, we prove that fundamental RL algorithms like Natural
Policy Gradient can be seen as variants of REBEL, which allows us to match the
strongest known theoretical guarantees in terms of convergence and sample
complexity in the RL literature. REBEL can also cleanly incorporate offline
data and be extended to handle the intransitive preferences we frequently see
in practice. Empirically, we find that REBEL provides a unified approach to
language modeling and image generation with stronger or similar performance as
PPO and DPO, all while being simpler to implement and more computationally
efficient than PPO. When fine-tuning Llama-3-8B-Instruct, REBEL achieves strong
performance in AlpacaEval 2.0, MT-Bench, and Open LLM Leaderboard.
comment: New experimental results on general chat
♻ ☆ Extrapolated Urban View Synthesis Benchmark
Xiangyu Han, Zhen Jia, Boyi Li, Yan Wang, Boris Ivanovic, Yurong You, Lingjie Liu, Yue Wang, Marco Pavone, Chen Feng, Yiming Li
Photorealistic simulators are essential for the training and evaluation of
vision-centric autonomous vehicles (AVs). At their core is Novel View Synthesis
(NVS), a crucial capability that generates diverse unseen viewpoints to
accommodate the broad and continuous pose distribution of AVs. Recent advances
in radiance fields, such as 3D Gaussian Splatting, achieve photorealistic
rendering at real-time speeds and have been widely used in modeling large-scale
driving scenes. However, their performance is commonly evaluated using an
interpolated setup with highly correlated training and test views. In contrast,
extrapolation, where test views largely deviate from training views, remains
underexplored, limiting progress in generalizable simulation technology. To
address this gap, we leverage publicly available AV datasets with multiple
traversals, multiple vehicles, and multiple cameras to build the first
Extrapolated Urban View Synthesis (EUVS) benchmark. Meanwhile, we conduct
quantitative and qualitative evaluations of state-of-the-art Gaussian Splatting
methods across different difficulty levels. Our results show that Gaussian
Splatting is prone to overfitting to training views. Besides, incorporating
diffusion priors and improving geometry cannot fundamentally improve NVS under
large view changes, highlighting the need for more robust approaches and
large-scale training. We have released our data to help advance self-driving
and urban robotics simulation technology.
comment: Project page: https://ai4ce.github.io/EUVS-Benchmark/
♻ ☆ Semi-supervised 3D Semantic Scene Completion with 2D Vision Foundation Model Guidance AAAI2025
Duc-Hai Pham, Duc Dung Nguyen, Hoang-Anh Pham, Ho Lai Tuan, Phong Ha Nguyen, Khoi Nguyen, Rang Nguyen
Accurate prediction of 3D semantic occupancy from 2D visual images is vital
in enabling autonomous agents to comprehend their surroundings for planning and
navigation. State-of-the-art methods typically employ fully supervised
approaches, necessitating a huge labeled dataset acquired through expensive
LiDAR sensors and meticulous voxel-wise labeling by human annotators. The
resource-intensive nature of this annotating process significantly hampers the
application and scalability of these methods. We introduce a novel
semi-supervised framework to alleviate the dependency on densely annotated
data. Our approach leverages 2D foundation models to generate essential 3D
scene geometric and semantic cues, facilitating a more efficient training
process. Our framework exhibits notable properties: (1) Generalizability,
applicable to various 3D semantic scene completion approaches, including 2D-3D
lifting and 3D-2D transformer methods. (2) Effectiveness, as demonstrated
through experiments on SemanticKITTI and NYUv2, wherein our method achieves up
to 85% of the fully-supervised performance using only 10% labeled data. This
approach not only reduces the cost and labor associated with data annotation
but also demonstrates the potential for broader adoption in camera-based
systems for 3D semantic occupancy prediction.
comment: Accepted at AAAI2025
♻ ☆ Breaking The Ice: Video Segmentation for Close-Range Ice-Covered Waters
Rapid ice recession in the Arctic Ocean, with predictions of ice-free summers
by 2060, opens new maritime routes but requires reliable navigation solutions.
Current approaches rely heavily on subjective expert judgment, underscoring the
need for automated, data-driven solutions. This study leverages machine
learning to assess ice conditions using ship-borne optical data, introducing a
finely annotated dataset of 946 images, and a semi-manual, region-based
annotation technique. The proposed video segmentation model, UPerFlow, advances
the SegFlow architecture by incorporating a six-channel ResNet encoder, two
UPerNet-based segmentation decoders for each image, PWCNet as the optical flow
encoder, and cross-connections that integrate bi-directional flow features
without loss of latent information. The proposed architecture outperforms
baseline image segmentation networks by an average 38% in occluded regions,
demonstrating the robustness of video segmentation in addressing challenging
Arctic conditions.
♻ ☆ RSUniVLM: A Unified Vision Language Model for Remote Sensing via Granularity-oriented Mixture of Experts
Remote Sensing Vision-Language Models (RS VLMs) have made much progress in
the tasks of remote sensing (RS) image comprehension. While performing well in
multi-modal reasoning and multi-turn conversations, the existing models lack
pixel-level understanding and struggle with multi-image inputs. In this work,
we propose RSUniVLM, a unified, end-to-end RS VLM designed for comprehensive
vision understanding across multiple granularity, including image-level,
region-level, and pixel-level tasks. RSUniVLM also performs effectively in
multi-image analysis, with instances of change detection and change captioning.
To enhance the model's ability to capture visual information at different
levels without increasing model size, we design a novel architecture called
Granularity-oriented Mixture of Experts to constraint the model to about 1
billion parameters. We also construct a large-scale RS instruction-following
dataset based on a variety of existing datasets in both RS and general domain,
encompassing various tasks such as object localization, visual question
answering, and semantic segmentation. Substantial experiments have been
conducted to validate the superiority of the proposed RSUniVLM up to
state-of-the-art across various RS tasks. Code and model will be available at
\href{https://github.com/xuliu-cyber/RSUniVLM}{here}.
♻ ☆ Neural Modulation Alteration to Positive and Negative Emotions in Depressed Patients: Insights from fMRI Using Positive/Negative Emotion Atlas
Yu Feng, Weiming Zeng, Yifan Xie, Hongyu Chen, Lei Wang, Yingying Wang, Hongjie Yan, Kaile Zhang, Ran Tao, Wai Ting Siok, Nizhuan Wang
Background: Although it has been noticed that depressed patients show
differences in processing emotions, the precise neural modulation mechanisms of
positive and negative emotions remain elusive. FMRI is a cutting-edge medical
imaging technology renowned for its high spatial resolution and dynamic
temporal information, making it particularly suitable for the neural dynamics
of depression research. Methods: To address this gap, our study firstly
leveraged fMRI to delineate activated regions associated with positive and
negative emotions in healthy individuals, resulting in the creation of positive
emotion atlas (PEA) and negative emotion atlas (NEA). Subsequently, we examined
neuroimaging changes in depression patients using these atlases and evaluated
their diagnostic performance based on machine learning. Results: Our findings
demonstrate that the classification accuracy of depressed patients based on PEA
and NEA exceeded 0.70, a notable improvement compared to the whole-brain
atlases. Furthermore, ALFF analysis unveiled significant differences between
depressed patients and healthy controls in eight functional clusters during the
NEA, focusing on the left cuneus, cingulate gyrus, and superior parietal
lobule. In contrast, the PEA revealed more pronounced differences across
fifteen clusters, involving the right fusiform gyrus, parahippocampal gyrus,
and inferior parietal lobule. Limitations: Due to the limited sample size and
subtypes of depressed patients, the efficacy may need further validation in
future. Conclusions: These findings emphasize the complex interplay between
emotion modulation and depression, showcasing significant alterations in both
PEA and NEA among depression patients. This research enhances our understanding
of emotion modulation in depression, with implications for diagnosis and
treatment evaluation.
comment: 24 pages
♻ ☆ M2OST: Many-to-one Regression for Predicting Spatial Transcriptomics from Digital Pathology Images AAAI 2025
The advancement of Spatial Transcriptomics (ST) has facilitated the
spatially-aware profiling of gene expressions based on histopathology images.
Although ST data offers valuable insights into the micro-environment of tumors,
its acquisition cost remains expensive. Therefore, directly predicting the ST
expressions from digital pathology images is desired. Current methods usually
adopt existing regression backbones along with patch-sampling for this task,
which ignores the inherent multi-scale information embedded in the pyramidal
data structure of digital pathology images, and wastes the inter-spot visual
information crucial for accurate gene expression prediction. To address these
limitations, we propose M2OST, a many-to-one regression Transformer that can
accommodate the hierarchical structure of the pathology images via a decoupled
multi-scale feature extractor. Unlike traditional models that are trained with
one-to-one image-label pairs, M2OST uses multiple images from different levels
of the digital pathology image to jointly predict the gene expressions in their
common corresponding spot. Built upon our many-to-one scheme, M2OST can be
easily scaled to fit different numbers of inputs, and its network structure
inherently incorporates nearby inter-spot features, enhancing regression
performance. We have tested M2OST on three public ST datasets and the
experimental results show that M2OST can achieve state-of-the-art performance
with fewer parameters and floating-point operations (FLOPs). The code is
available at: https://github.com/Dootmaan/M2OST.
comment: Accepted by AAAI 2025
♻ ☆ Double-Shot 3D Shape Measurement with a Dual-Branch Network for Structured Light Projection Profilometry
Mingyang Lei, Jingfan Fan, Long Shao, Hong Song, Deqiang Xiao, Danni Ai, Tianyu Fu, Ying Gu, Jian Yang
The structured light (SL)-based three-dimensional (3D) measurement techniques
with deep learning have been widely studied to improve measurement efficiency,
among which fringe projection profilometry (FPP) and speckle projection
profilometry (SPP) are two popular methods. However, they generally use a
single projection pattern for reconstruction, resulting in fringe order
ambiguity or poor reconstruction accuracy. To alleviate these problems, we
propose a parallel dual-branch Convolutional Neural Network (CNN)-Transformer
network (PDCNet), to take advantage of convolutional operations and
self-attention mechanisms for processing different SL modalities. Within
PDCNet, a Transformer branch is used to capture global perception in the fringe
images, while a CNN branch is designed to collect local details in the speckle
images. To fully integrate complementary features, we design a double-stream
attention aggregation module (DAAM) that consists of a parallel attention
subnetwork for aggregating multi-scale spatial structure information. This
module can dynamically retain local and global representations to the maximum
extent. Moreover, an adaptive mixture density head with bimodal Gaussian
distribution is proposed for learning a representation that is precise near
discontinuities. Compared to the standard disparity regression strategy, this
adaptive mixture head can effectively improve performance at object boundaries.
Extensive experiments demonstrate that our method can reduce fringe order
ambiguity while producing high-accuracy results on self-made datasets.
♻ ☆ Diffusion Forcing: Next-token Prediction Meets Full-Sequence Diffusion
This paper presents Diffusion Forcing, a new training paradigm where a
diffusion model is trained to denoise a set of tokens with independent
per-token noise levels. We apply Diffusion Forcing to sequence generative
modeling by training a causal next-token prediction model to generate one or
several future tokens without fully diffusing past ones. Our approach is shown
to combine the strengths of next-token prediction models, such as
variable-length generation, with the strengths of full-sequence diffusion
models, such as the ability to guide sampling to desirable trajectories. Our
method offers a range of additional capabilities, such as (1) rolling-out
sequences of continuous tokens, such as video, with lengths past the training
horizon, where baselines diverge and (2) new sampling and guiding schemes that
uniquely profit from Diffusion Forcing's variable-horizon and causal
architecture, and which lead to marked performance gains in decision-making and
planning tasks. In addition to its empirical success, our method is proven to
optimize a variational lower bound on the likelihoods of all subsequences of
tokens drawn from the true joint distribution. Project website:
https://boyuan.space/diffusion-forcing
comment: Project website: https://boyuan.space/diffusion-forcing
♻ ☆ Object-level Geometric Structure Preserving for Natural Image Stitching
The topic of stitching images with globally natural structures holds
paramount significance, with two main goals: pixel-level alignment and
distortion prevention. The existing approaches exhibit the ability to align
well, yet fall short in maintaining object structures. In this paper, we
endeavour to safeguard the overall OBJect-level structures within images based
on Global Similarity Prior (OBJ-GSP), on the basis of good alignment
performance. Our approach leverages semantic segmentation models like the
family of Segment Anything Model to extract the contours of any objects in a
scene. Triangular meshes are employed in image transformation to protect the
overall shapes of objects within images. The balance between alignment and
distortion prevention is achieved by allowing the object meshes to strike a
balance between similarity and projective transformation. We also demonstrate
that object-level semantic information is necessary in low-altitude aerial
image stitching. Additionally, we propose StitchBench, the largest image
stitching benchmark with most diverse scenarios. Extensive experimental results
demonstrate that OBJ-GSP outperforms existing methods in both pixel alignment
and shape preservation. Code and dataset is publicly available at
\url{https://github.com/RussRobin/OBJ-GSP}.
♻ ☆ Classification of the lunar surface pattern by AI architectures: Does AI see a rabbit in the Moon?
In Asian countries, there is a tradition that a rabbit, known as the Moon
rabbit, lives on the Moon. Typically, two reasons are mentioned for the origin
of this tradition. The first reason is that the color pattern of the lunar
surface resembles the shape of a rabbit. The second reason is that both the
Moon and rabbits are symbols of fertility, as the Moon appears and disappears
(i.e., waxing and waning) cyclically and rabbits are known for their high
fertility. Considering the latter reason, is the color pattern of the lunar
surface not similar to a rabbit? Here, the similarity between rabbit and the
lunar surface pattern was evaluated using seven AI architectures. In the test
conducted with Contrastive Language-Image Pre-Training (CLIP), which can
classify images based on given words, it was assumed that people frequently
observe the Moon in the early evening. Under this condition, the lunar surface
pattern was found to be more similar to a rabbit than a face in low-latitude
regions, while it could also be classified as a face as the latitude increases.
This result is consistent with that the oldest literatures about the Moon
rabbit were written in India and that a tradition of seeing a human face in the
Moon exists in Europe. In a 1000-class test using seven AI architectures,
ConvNeXt and CLIP sometimes classified the lunar surface pattern as a rabbit
with relatively high probabilities. Cultures are generated by our attitude to
the environment. Both dynamic and static similarities may be essential to
induce our imagination.
comment: 20 pages, 5 figures, 2 tables, accepted for publicaion in AI &
Society
♻ ☆ Rethinking Alignment and Uniformity in Unsupervised Semantic Segmentation AAAI23
Unsupervised image semantic segmentation(UISS) aims to match low-level visual
features with semantic-level representations without outer supervision. In this
paper, we address the critical properties from the view of feature alignments
and feature uniformity for UISS models. We also make a comparison between UISS
and image-wise representation learning. Based on the analysis, we argue that
the existing MI-based methods in UISS suffer from representation collapse. By
this, we proposed a robust network called Semantic Attention Network(SAN), in
which a new module Semantic Attention(SEAT) is proposed to generate pixel-wise
and semantic features dynamically. Experimental results on multiple semantic
segmentation benchmarks show that our unsupervised segmentation framework
specializes in catching semantic representations, which outperforms all the
unpretrained and even several pretrained methods.
comment: AAAI23